Communicated by Christopher Bishop
Flat Minima Sepp Hochreiter ¨ Informatik, Technische Universit¨at Munchen, ¨ ¨ Fakult¨at fur 80290 Munchen, Germany
Jurgen ¨ Schmidhuber IDSIA, Corso Elvezia 36, 6900 Lugano, Switzerland
We present a new algorithm for finding low-complexity neural networks with high generalization capability. The algorithm searches for a “flat” minimum of the error function. A flat minimum is a large connected region in weight space where the error remains approximately constant. An MDL-based, Bayesian argument suggests that flat minima correspond to “simple” networks and low expected overfitting. The argument is based on a Gibbs algorithm variant and a novel way of splitting generalization error into underfitting and overfitting error. Unlike many previous approaches, ours does not require gaussian assumptions and does not depend on a “good” weight prior. Instead we have a prior over inputoutput functions, thus taking into account net architecture and training set. Although our algorithm requires the computation of second-order derivatives, it has backpropagation’s order of complexity. Automatically, it effectively prunes units, weights, and input lines. Various experiments with feedforward and recurrent nets are described. In an application to stock market prediction, flat minimum search outperforms conventional backprop, weight decay, and “optimal brain surgeon/optimal brain damage.” 1 Basic Ideas and Outline Our algorithm tries to find a large region in weight space with the property that each weight vector from that region leads to similar small error. Such a region is called a flat minimum (Hochreiter and Schmidhuber 1995). To get an intuitive feeling for why a flat minimum is interesting, consider this: A sharp minimum (see Fig. 2) corresponds to weights that have to be specified with high precision. A flat minimum (see Fig. 1) corresponds to weights, many of which can be given with low precision. In the terminology of the theory of minimum description (message) length (MML, Wallace and Boulton 1968; MDL, Rissanen 1978), fewer bits of information are required to describe a flat minimum (corresponding to a “simple” or low-complexity network). The MDL principle suggests that low network complexity corresponds to high generalization performance. Similarly, the standard Bayesian Neural Computation 9, 1–42 (1997)
c 1997 Massachusetts Institute of Technology °
2
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Figure 1: Example of a flat minimum.
Figure 2: Example of a sharp minimum.
view favors “fat” maxima of the posterior weight distribution (maxima with a lot of probability mass; see, e.g., Buntine and Weigend 1991). We will see that flat minima are fat maxima. Unlike, e.g., Hinton and van Camp’s method (1993), our algorithm does not depend on the choice of a “good” weight prior. It finds a flat minimum by searching for weights that minimize both training error and weight precision. This requires the computation of the Hessian. However, by using an efficient second-order method (Pearlmutter 1994; Møller 1993), we obtain conventional backpropagation’s order of computational complexity. Automatically, the method effectively reduces numbers of units, weights, and
Flat Minima
3
input lines, as well as output sensitivity with respect to remaining weights and units. Unlike simple weight decay, our method automatically treats and prunes units and weights in different layers in different reasonable ways. The article is organized as follows: • Section 2 formally introduces basic concepts, such as error measures and flat minima. • Section 3 describes the novel algorithm, flat minimum search (FMS). • Section 4 formally derives the algorithm. • Section 5 reports experimental generalization results with feedforward and recurrent networks. For instance, in an application to stock market prediction, flat minimum search outperforms the following widely used competitors: conventional backpropagation, weight decay, and “optimal brain surgeon/optimal brain damage.” • Section 6 mentions relations to previous work. • Section 7 mentions limitations of the algorithm and outlines future work. • The Appendix presents a detailed theoretical justification of our approach. Using a variant of the Gibbs algorithm, Section A.1 defines generalization, underfitting and overfitting error in a novel way. By defining an appropriate prior over input-output functions, we postulate that the most probable network is a flat one. Section A.2 formally justifies the error function minimized by our algorithm. Section A.3 shows how to compute derivatives required by the algorithm. 2 Task Architecture and Boxes 2.1 Generalization Task. The task is to approximate an unknown function f ⊂ X × Y mapping a finite set of possible inputs X ⊂ RN to a finite set of possible outputs Y ⊂ RK . A data set D is obtained from f (see Section A.1). All training information is given by a finite set D0 ⊂ D. D0 is called the training set. The pth element of D0 is denoted by an input-target pair (xp , yp ). 2.2 Architecture and Net Functions. For simplicity, we will focus on a standard feedforward net (but in the experiments, we will use recurrent nets as well). The net has N input units, K output units, L weights, and differentiable activation functions. It maps input vectors x ∈ RN to output vectors o(w, x) ∈ RK , where w is the L-dimensional weight vector, and the weight on the connection from unit j to i is denoted wij . The net func-
4
Sepp Hochreiter and Jurgen ¨ Schmidhuber
induced by w is denoted net(w): for¢ x ∈ RN , net(w)(x) = o(w, x) = ¡tion 1 o (w, x), o2 (w, x), . . . , oK−1 (w, x), oK (w, x) , where oi (w, x) denotes the ith component of o(w, x), corresponding to output unit i. 2.3 Training Error. We use squared error E(net(w), D0 ) :=
P
(xp ,yp )∈D0
kyp − o(w, xp )k2 , where k · k denotes the Euclidean norm. 2.4 Tolerable Error. To define a region in weight space with the property that each weight vector from that region leads to small error and similar output, we introduce the tolerable error Etol , a positive constant (see Section A.1 for a formal definition of Etol ). “Small” error is defined as being smaller than Etol . E(net(w), D0 ) > Etol implies underfitting. 2.5 Boxes. Each weight w satisfying E(net(w), D0 ) ≤ Etol defines an acceptable minimum (compare M(D0 ) in Section A.1). We are interested in a large region of connected acceptable minima, where each weight w within this region leads to almost identical net functions net(w). Such a region is called a flat minimum. We will see that flat minima correspond to low expected generalization error. To simplify the algorithm for finding a large connected region (see below), we do not consider maximal connected regions but focus on so-called boxes within regions: for each acceptable minimum w, its box Mw in weight space is an L-dimensional hypercuboid with center w. For simplicity, each edge of the box is taken to be parallel to one weight axis. Half the length of the box edge in the direction of the axis corresponding to weight wij is denoted by 1wij (X). The 1wij (X) are the maximal (positive) values such that for all L-dimensional vectors κ whose components κij are restricted by |κij | ≤ 1wij (X), we have: E(net(w), net(w + κ), X) ≤ ², where P E(net(w), net(w + κ), X) = x∈X ko(w, x) − o(w + κ, x)k2 , and ² is a small positive constant defining tolerable output changes (see also equation 3.1). Note that 1wij (X) depends on ². Since our algorithm does not use ², however, it is notationally suppressed. 1wij (X) gives the precision of wij . Mw ’s Q box volume is defined by V(1w(X)) := 2L i,j 1wij (X), where 1w(X) denotes the vector with components 1wij (X). Our goal is to find large boxes within flat minima. 3 The Algorithm Let X0 = {xp | (xp , yp ) ∈ D0 } denote the inputs of the training set. We approximate 1w(X) by 1w(X0 ), where 1w(X0 ) is defined like 1w(X) in Section 2.5 (replacing X by X0 ). For simplicity, we will abbreviate 1w(X0 ) as 1w. Starting with a random initial weight vector, flat minimum search (FMS) tries to find a w that not only has low E(net(w), D0 ) but also defines a box Mw with maximal box volume V(1w) and, consequently, minimal P ˜ B(w, X0 ) := − log( 21L V(1w)) = i,j − log 1wij . Note the relationship to
Flat Minima
5
MDL: B˜ is the number of bits required to describe the weights, whereas the number of bits needed to describe the yp , given w (with (xp , yp ) ∈ D0 ), can be bounded by fixing Etol (see Section A.1). In the next section we derive the following algorithm. We use gradient descent P to minimize E(w, D0 ) = E(net(w), D0 ) + λB(w, X0 ), where B(w, X0 ) = xp ∈X0 B(w, xp ), and
à !2 X X ∂ok (w, xp ) 1 B(w, xp ) = −L log ² + log 2 ∂wij i,j k
(3.1)
2 ¯ k ¯ ¯ ∂o (w,xp ) ¯ ¯ ∂wij ¯ X X . + L log r ³ ´ 2 k P i,j
k
k
∂o (w,xp ) ∂wij
Here ok (w, xp ) is the activation of the kth output unit (given weight vector w and input xp ), ² is a constant, and λ is the regularization constant (or hyperparameter) that controls the trade-off between regularization and training error (see Section A.1). To minimize B(w, X0 ), for each xp ∈ X0 we have to compute ∂B(w, xp ) X ∂B(w, xp ) ∂ 2 ok (w, xp ) ³ k ´ = for all u, v. ∂o (w,xp ) ∂wuv ∂wij ∂wuv k,i,j ∂
(3.2)
∂wij
It can be shown that by using Pearlmutter’s and Møller’s efficient secondorder method, the gradient of B(w, xp ) can be computed in O(L) time. Therefore, our algorithm has the same order of computational complexity as standard backpropagation. 4 Derivation of the Algorithm 4.1 Outline. We are interested in weights representing nets with tolerable error but flat outputs (see Sections 2 and A.1). To find nets with flat outputs, two conditions will be defined to specify B(w, xp ) for xp ∈ X0 and, as a consequence, B(w, X0 ) (see Section 3). The first condition ensures flatness. The second condition enforces equal flatness in all weight space directions, to obtain low variance of the net functions induced by weights within a box. The second condition will be justified using an MDL-based argument. In both cases, linear approximations will be made (to be justified in Section A.2). 4.2 Formal Details. We are interested in weights causing tolerable error (see “acceptable minima” in Section 2) that can be perturbed without
6
Sepp Hochreiter and Jurgen ¨ Schmidhuber
causing significant output changes, thus indicating the presence of many neighboring weights leading to the same net function. By searching for the boxes from Section 2, we are actually searching for low-error weights whose perturbation does not significantly change the net function. In what follows we treat the input xp as fixed. For convenience, we suppress xp , abbreviating ok (w, xp ) by ok (w). Perturbing the weights w by δw P (with components δwij ), we obtain ED(w, δw) := k (ok (w + δw) − ok (w))2 , where ok (w) expresses ok ’s dependence on w (in what follows, however, w often will be suppressed for convenience; we abbreviate ok (w) by ok ). Linear approximation (justified in Section A.2) gives us flatness condition 1: 2 X X ∂ok δwij ≤ EDl,max (δw) ED(w, δw) ≈ EDl (δw) := ∂wij i,j k
(4.1)
2 ¯ ¯ X X ¯¯ ∂ok ¯¯ := ¯ ¯ |δwij | ≤ ² , ¯ ∂wij ¯ k
i,j
where ² > 0 defines tolerable output changes within a box and is small enough to allow for linear approximation (it does not appear in B(w, xp )’s and B(w, D0 )’s gradients; see Section 3). EDl is ED’s linear approximation, and EDl,max is max{EDl (w, δv)| ∀ij : δvij = ±δwij }. Flatness condition 1 is a “robustness condition” (or “fault tolerance condition” or “perturbation tolerance condition”; see, e.g., Minai and Williams 1994; Murray and Edwards 1993; Neti et al. 1992; Matsuoka 1992; Bishop 1993; Kerlirzin and Vallet 1993; Carter et al. 1990). Many boxes Mw satisfy flatness condition 1. To select a particular, very flat Mw , the following flatness condition 2 uses up degrees of freedom left by inequality 4.1: 2
∀i, j, u, v : (δwij )
X k
Ã
∂ok ∂wij
!2 = (δwuv )
2
X k
Ã
∂ok ∂wuv
!2 .
(4.2)
Flatness condition 2 enforces equal “directed errors” X X EDij (w, δwij ) = (ok (wij + δwij ) − ok (wij ))2 ≈ k
k
Ã
∂ok δwij ∂wij
!2 ,
where ok (wij ) has the obvious meaning, and δwij is the i, jth component of δw. Linear approximation is justified by the choice of ² in inequality 4.1. As will be seen in the MDL justification to be presented below, flatness condition 2 favors the box that minimizes the mean perturbation error within the box. This corresponds to minimizing the variance of the net functions induced by weights within the box (recall that ED(w, δw) is quadratic).
Flat Minima
7
4.3 Deriving the Algorithm from Flatness Conditions 1 and 2. We first solve equation 4.2 for |δwij |: v u P ³ k ´2 u ∂o u k ∂wuv u |δwij | = |δwuv |t (fixing u, v for all i, j) . P ³ ∂ok ´2 k
∂wij
Then we insert the |δwij | (with fixed u, v) into inequality 4.1 (replacing the second “≤” in 4.1 with “=”, because we search for the box with maximal volume). This gives us an equation for the |δwuv | (which depend on w, but this is notationally suppressed): |δwuv | =
√ ² v 2 . u ¯ ¯ u r ¯ ¯ k ∂o u ¯ ∂w ¯ P ³ ∂ok ´2 uP P ij r u k ∂wuv t k i,j P ³ k ´2 ∂o k
(4.3)
∂wij
The |δwij | (u, v is replaced by i, j) approximate the 1wij from Section 2. The box Mw is approximated by AMw , the box with center w and edge lengths 2δwij . Mw ’s volume V(1w) is approximated by AMw ’s box volume Q ˜ V(δw) := 2L ij |δwij |. Thus, B(w, xp ) (see Section 3) can be approximated by P 1 B(w, xp ) := − log 2L V(δw) = i,j − log |δwij |. This immediately leads to the algorithm given by equation 3.1. 4.4 How Can the Above Approximations Be Justified? The learning process itself enforces their validity (see Section A.2). Initially, the conditions above are valid only in a very small environment of an “initial” acceptable minimum. But during the search for new acceptable minima with more associated box volume, the corresponding environments are enlarged. Section A.2 will prove this for feedforward nets (experiments indicate that this appears to be true for recurrent nets as well). 4.5 Comments. Flatness condition 2 influences the algorithm as follows: (1) The algorithm prefers to increase the δwij ’s of weights whose current contributions are not important to compute the target output; (2) The algorithm enforces equal sensitivity of all output units with respect to weights of connections to hidden units. Hence, output units tend to share hidden units; that is, different hidden units tend to contribute equally to the computation of the target. The contributions of a particular hidden unit to different output unit activations tend to be equal too. Flatness condition 2 is essential. Flatness condition 1 by itself corresponds to nothing more than first-order derivative reduction (ordinary sensitivity
8
Sepp Hochreiter and Jurgen ¨ Schmidhuber
reduction). However, what we really want is to minimize the variance of the net functions induced by weights near the actual weight vector. Automatically, the algorithm treats units and weights in different layers differently, and takes the nature of the activation functions into account. 4.6 MDL Justification of Flatness Condition 2. Let us assume a sender wants to send a description of the function induced by w to a receiver who knows the inputs xp but not the targets yp , where (xp , yp ) ∈ D0 . The MDL principle suggests that the sender wants to minimize the expected description length of the net function. Let EDmean (w, X0 ) denote the mean value of ED on the box. Expected description length is approximated by µEDmean (w, X0 ) + B(w, X0 ) + c, where c, µ are positive constants. One way of seeing this is to apply Hinton and van Camp’s “bits back” argument to a uniform weight prior (EDmean corresponds to the output variance). However, we prefer to use a different argument: We encode each weight wij of the box center w by a bitstring according to the following procedure (1wij is given): (0) Define a variable interval Iij ⊂ R. (1) Make Iij equal to the interval constraining possible weight values. (2) While Iij 6⊂ [wij − 1wij , wij + 1wij ]: Divide Iij into two equally sized disjunct intervals I1 and I2 . If wij ∈ I1 , then Iij ← I1 ; write ‘1’. If wij ∈ I2 , then Iij ← I2 ; write ‘0’. The final set {Iij } corresponds to a bit box within our box. This bit box contains ˜ X0 ) + c, where Mw ’s center w and is described by a bitstring of length B(w, the constant c is independent of the box Mw . From ED(w, wb − w) (wb is the center of the bit box) and the bitstring describing the bit box, the receiver can compute w by selecting an initialization weight vector within the bit box and using gradient descent to decrease B(wa , X0 ) until ED(wa , wb − wa ) = ED(w, wb − w), where wa in the bit box denotes the receiver’s current approximation of w (wa is constantly updated by the receiver). This is like “FMS without targets.” Recall that the receiver knows the inputs xp . Since w corresponds to the weight vector with the highest degree of local flatness within the bit box, the receiver will find the correct w. ED(w, wb − w) is described by a gaussian distribution with mean zero. Hence, the description length of ED(w, wb − w) is µED(w, wb − w) (Shannon 1948). wb , the center of the bit box, cannot be known before training. However, we do know the expected description length of the net function,
Flat Minima
9
˜ which is µEDmean + B(w, X0 ) + c (c is a constant independent of w). Let us approximate EDmean : EDl,mean (w, δw) :=
1 V(δw)
Z AMw
EDl (w, δv)dδv
à X X 1 L1 = 2 (δwij )3 V(δw) 3 i,j k à ×
∂ok ∂wij
!2
Y
δwuv
u,vwith u,v6=i,j
à !2 ¢2 X ∂ok 1 X¡ . δwij = 3 i,j ∂wij k Among those w that lead to equal B(w, X0 ) (the negative logarithm of the box volume plus L log 2), we want to find those with minimal description length of the function induced by w. Using Lagrange multipliers (viewing the δwij as variables), it can be shown that EDl,mean is minimal under the condition B(w, X0 ) = constant iff flatness condition 2 holds. To conclude: With given box volume, we need flatness condition 2 to minimize the expected description length of the function induced by w. 5 Experimental Results 5.1 Experiment 1: Noisy Classification. 5.1.1 Task. The first task is taken from Pearlmutter and Rosenfeld (1991). The task is to decide whether the x-coordinate of a point in two-dimensional space exceeds zero (class 1) or does not (class 2). Noisy training–test examples are generated as follows: data points are obtained from a gaussian with zero mean and standard deviation 1.0, bounded in the interval [−3.0, 3.0]. The data points are misclassified with probability 0.05. Final input data are obtained by adding a zero mean gaussian with stdev 0.15 to the data points. In a test with 2 million data points, it was found that the procedure above leads to 9.27% misclassified data. No method will misclassify less than 9.27%, due to the inherent noise in the data (including the test data). The training set is based on 200 fixed data points (see Fig. 3). The test set is based on 120,000 data points. 5.1.2 Results. Ten conventional backprop (BP) nets were tested against ten equally initialized networks trained by flat minimum search (FMS). After 1000 epochs, the weights of our nets essentially stopped changing
10
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Figure 3: The 200 input examples of the training set. Crosses represent data points from class 1. Squares represent data points from class 2.
(automatic “early stopping”), while BP kept changing weights to learn the outliers in the data set and overfit. In the end, our approach left a single hidden unit h with a maximal weight of 30.0 or −30.0 from the x-axis input. Unlike with BP, the other hidden units were effectively pruned away (outputs near zero). So was the y-axis input (zero weight to h). It can be shown that this corresponds to an “optimal” net with minimal numbers of units and weights. Table 1 illustrates the superior performance of our approach. Parameters: Learning rate: 0.1. Architecture: (2-20-1). Number of training epochs: 400,000. With FMS: Etol = 0.0001. See Section 5.6 for parameters common to all experiments. 5.2 Experiment 2: Recurrent Nets. 5.2.1 Time-varying inputs. The method works for continually running fully recurrent nets as well. At every time step, a recurrent net with sigmoid activations in [0, 1] sees an input vector from a stream of randomly chosen input vectors from the set {(0, 0), (0, 1), (1, 0), (1, 1)}. The task is to switch on
Flat Minima
11
Table 1: Ten Comparisons of Conventional Backpropagation (BP) and Flat Minimum Search (FMS).
Backpropagation
1 2 3 4 5
FMS
Backpropagation
FMS
MSE
dto
MSE
dto
MSE
dto
MSE
dto
0.220 0.223 0.222 0.213 0.222
1.35 1.16 1.37 1.18 1.24
0.193 0.189 0.186 0.181 0.195
0.00 6 0.09 7 0.13 8 0.01 9 0.25 10
0.219 0.215 0.214 0.218 0.214
1.24 1.14 1.10 1.21 1.21
0.187 0.187 0.185 0.190 0.188
0.04 0.07 0.01 0.09 0.07
Note: The MSE column shows mean squared error on the test set. The dto column shows the difference between the percentage of misclassifications and the optimal percentage (9.27). The remaining rows provide the analogous information for FMS, which clearly outperforms backpropagation.
the first output unit whenever an input (1, 0) had occurred two time steps ago and to switch on the second output unit without delay in response to any input (0, 1). The task can be solved by a single hidden unit. 5.2.2 Non-weight-decay-like results. With conventional recurrent net algorithms, after training, both hidden units were used to store the input vector. This was not so with our new approach. We trained 20 networks. All of them learned perfect solutions. As with weight decay, most weights to the output decayed to zero. But unlike with weight decay, strong inhibitory connections (−30.0) switched off one of the hidden units, effectively pruning it away. Parameters: Learning rate: 0.1. Architecture: (2-2-2). Number of training examples: 1,500. Etol = 0.0001. See Section 5.6 for parameters common to all experiments.
12
Sepp Hochreiter and Jurgen ¨ Schmidhuber
5.3 Experiment 3: Stock Market Prediction 1. 5.3.1 Task. We predict the DAX1 (the German stock market index) using fundamental indicators. Following Rehkugler and Poddig (1990), the net sees the following indicators: (1) German interest rate (Umlaufsrendite), (2) industrial production divided by money supply, and (3) business sentiments (IFO Gesch¨aftsklimaindex). The input (scaled in the interval [−3.4, 3.4]) is the difference between data from the current quarter and last year’s corresponding quarter. The goal is to predict the sign of next year’s corresponding DAX difference. 5.3.2 Details. The training set consists of 24 data vectors from 1966 to 1972. Positive DAX tendency is mapped to target 0.8; otherwise the target is -0.8. The test set consists of 68 data vectors from 1973 to 1990. FMS is compared against (1) conventional backpropagation (BP8) with 8 hidden units, (2) BP with 4 hidden units (BP4) (4 hidden units are chosen because pruning methods favor 4 hidden units, but 3 is not enough), (3) optimal brain surgeon (OBS; Hassibi and Stork 1993), with a few improvements (see Section 5.6), and (4) weight decay (WD) according to Weigend et al. (1991) (WD and OBS were chosen because they are well known and widely used). 5.3.3 Performance Measure. Since wrong predictions lead to loss of money, performance is measured as follows. The sum of incorrectly predicted DAX changes is subtracted from the sum of correctly predicted DAX changes. The result is divided by the sum of absolute DAX changes. 5.3.4 Results. Table 2 shows the results. Our method outperforms the other methods. Note that MSE is not a reasonable performance measure for this task. For instance, although FMS typically makes more correct classifications than WD, FMS’s MSE often exceeds WD’s. This is because WD’s wrong classifications tend to be close to 0, while FMS often prefers large weights yielding strong output activations. FMS’s few false classifications tend to contribute a lot to MSE. Parameters Learning rate: 0.01. Architecture: (3-8-1), except BP4 with (3-4-1). Number of training examples: 20 million.
1 Raw DAX version according to Statistisches Bundesamt (Federal Office of Statistics). Other data are from the same source (except for business sentiment). Collected by Christian Puritscher, for a diploma thesis in industrial management at LMU, Munich.
Flat Minima
13
Table 2: Comparisons of Conventional Backpropagation (BP4, BP8), Optimal Brain Surgeon (OBS), Weight Decay (WD), and Flat Minimum Search (FMS).
Method
BP8 BP4 OBS WD FMS
Train MSE
Test MSE
0.003 0.043 0.089 0.096 0.040
0.945 1.066 1.088 1.102 1.162
Removed w u
Max
14 22 24
47.33 42.02 48.89 44.47 47.74
3 4 4
Performance Min 25.74 42.02 27.17 36.47 39.70
Mean 37.76 42.02 41.73 43.49 43.62
Note: All nets except BP4 start out with eight hidden units. Each value is a mean of seven trials. Column MSE shows mean squared error. Column w shows the number of pruned weights. Column u shows the number of pruned units. The final three rows (max, min, mean) list maximal, minimal, and mean performance (see text) over seven trials. Note that test MSE is insignificant for performance evaluations (this is due to targets 0.8/ − 0.8, as opposed to the “real” DAX targets). Our method outperforms all other methods.
Method specific parameters: FMS: Etol = 0.13; 1λ = 0.001. WD: like with FMS, but w0 = 0.2. OBS: Etol = 0.015 (the same result was obtained with higher Etol values, e.g., 0.13). See Section 5.6 for parameters common to all experiments. 5.4 Experiment 4: Stock Market Prediction 2. 5.4.1 Task. We predict the DAX again, using the basic setup of the experiment in Section 5.3. However, the following modifications are introduced: • There are two additional inputs: dividend rate and foreign orders in manufacturing industry. • Monthly predictions are made. The net input is the difference between the current month’s data and last month’s data. The goal is to predict the sign of next month’s corresponding DAX difference. • There are 228 training examples and 100 test examples. • The target is the percentage of DAX change scaled in the interval [−1, 1] (outliers are ignored). • Performance of WD and FMS is also tested on networks “spoiled” by conventional BP (“WDR” and “FMSR”—the “R” stands for Retraining).
14
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Table 3: Comparisons of Conventional Backpropagation (BP), Optimal Brain Surgeon (OBS), Weight Decay after Spoiling the Net with BP (WDR), Flat Minimum Search after Spoiling the Net with BP (FMSR), Weight Decay (WD), and Flat Minimum Search (FMS).
Method
BP OBS WDR FMSR WD FMS
Train MSE
Test MSE
Removed w u
Max
0.181 0.219 0.180 0.180 0.235 0.240
0.535 0.502 0.538 0.542 0.452 0.472
15 0 0 17 19
57.33 50.78 62.54 64.07 54.04 54.11
1 0 0 3 3
Performance Min 20.69 32.20 13.64 24.58 32.03 31.12
Mean 41.61 40.43 41.17 41.57 40.75 44.40
Note: All nets start out with eight hidden units. Each value is a mean of ten trials. Column MSE shows mean squared error. Column w shows the number of pruned weights, column u shows the number of pruned units, the final three rows (max, min, mean) list maximal, minimal and mean performance (see text) over ten trials (note again that MSE is an irrelevant performance measure for this task). Flat minimum search outperforms all other methods.
5.4.2 Results. Table 3 shows the results. The average performance of our method exceeds the ones of weight decay, OBS, and conventional BP. Table 3 also shows the superior performance of our approach when it comes to retraining “spoiled” networks (note that OBS is a retraining method by nature). FMS led to the best improvements in generalization performance. Parameters Learning rate: 0.01. Architecture: (5-8-1). Number of training examples: 20,000,000. Method-specific parameters: FMS: Etol = 0.235; 1λ = 0.0001; if Eaverage < Etol then 1λ is set to 0.001. WD: like with FMS, but w0 = 0.2. FMSR: like with FMS, but Etol = 0.15; number of retraining examples: 5,000,000. WDR: like with FMSR, but w0 = 0.2. OBS: Etol = 0.235. See Section 5.6 for parameters common to all experiments.
Flat Minima
15
5.5 Experiment 5: Stock Market Prediction 3. 5.5.1 Task. This time, we predict the DAX using weekly technical (as opposed to fundamental) indicators. The data (DAX values and 35 technical indicators) were provided by Bayerische Vereinsbank. 5.5.2 Data Analysis. To analyze the data, we computed (1) the pairwise correlation coefficients of the 35 technical indicators and (2) the maximal pairwise correlation coefficients of all indicators and all linear combinations of two indicators. This analysis revealed that only four indicators are not highly correlated. For such reasons, our nets see only the eight most recent DAX changes and the following technical indicators: (a) the DAX value, (b) change of 24-week relative strength index (RSI)—the relation of increasing tendency to decreasing tendency, (c) 5-week statistic, and (d) MACD (smoothened difference of exponentially weighted 6-week and 24week DAX). 5.5.3 Input Data. The final network input is obtained by scaling the values (a–d) and the eight most recent DAX changes in [−2, 2]. The training set consists of 320 data points (July 1985–August 1991). The targets are the actual DAX changes scaled in [−1, 1]. 5.5.4 Comparison. The following methods are applied to the training set: (1) conventional BP, (2) optimal brain surgeon/optimal brain damage (OBS/OBD), (3) weight decay (WD) according to Weigend et al. (1991), and (4) flat minimum search (FMS). The resulting nets are evaluated on a test set consisting of 100 data points (August 1991–July 1993). Performance is measured as in Section 5.3. 5.5.5 Results. Table 4 shows the results. Again, our method outperforms the other methods. Parameters Learning rate: 0.01. Architecture: (12-9-1). Training time: 10 million examples. Method-specific parameters: OBS/OBD: Etol = 0.34. FMS: Etol = 0.34; 1λ = 0.003. If Eaverage < Etol then 1λ is set to 0.03. WD: like with FMS, but w0 = 0.2. See Section 5.6 for parameters common to all experiments.
16
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Table 4: Comparisons of Conventional Backpropagation (BP), Optimal Brain Surgeon (OBS), Weight Decay (WD), and Flat Minimum Search (FMS).
Method
BP OBS WD FMS
Train MSE
Test MSE
Removed w u
Max
0.13 0.38 0.51 0.46
1.08 0.912 0.334 0.348
55 110 103
28.45 27.37 26.84 29.72
1 8 7
Performance Min −16.7 −6.08 −6.88 18.09
Mean 8.08 10.70 12.97 21.26
Note: All nets start out with nine hidden units. Each value is a mean of ten trials. Column MSE shows mean squared error. Column w shows the number of pruned weights. Column u shows the number of pruned units. The final three rows (max, min, mean) list maximal, minimal, and mean performance (see text) over ten trials (note again that MSE is an irrelevant performance measure for this task). Flat minimum search outperforms all other methods.
5.6 Details and Parameters. With the exception of the experiment in Section 5.2, all units are sigmoid in the range of [−1.0, 1.0]. Weights are constrained to [−30, 30] and initialized in [−0.1, 0.1]. The latter ensures high first-order derivatives in the beginning of the learning phase. WD is set up to barely punish weights below w0 = 0.2. Eaverage is the average error on the training set, approximated using exponential decay: Eaverage ← γ Eaverage + (1 − γ )E(net(w), D0 ), where γ = 0.85. 5.6.1 FMS Details. To control B(w, D0 )’s influence during learning, its gradient is normalized and multiplied by the length of E(net(w), D0 )’s gradient (same for weight decay; see below). λ is computed as in Weigend et al. (1991) and initialized with 0. Absolute values of first-order derivatives are replaced by 10−20 if below this value. We ought to judge a weight wij as being pruned if δwij (see equation 4.5) exceeds the length of the weight range. However, the unknown scaling factor ² (see inequality 4.3 and equation 4.5) is required to compute δwij . Therefore, we judge a weight wij as being pruned if, with arbitrary ², δwij is much bigger than the corresponding δ’s of the other weights (typically, there are clearly separable classes of weights with high and low δ’s, which differ from each other by a factor ranging from 102 to 105 ). If all weights to and from a particular unit are very close to zero, the unit is lost. Due to tiny derivatives, the weights will never again increase significantly. Sometimes it is necessary to bring lost units back into the game. For this purpose, every ninit time steps (typically, ninit = 500,000), all weights wij with 0 ≤ wij < 0.01 are randomly reinitialized in [0.005, 0.01]; all weights
Flat Minima
17
wij with 0 ≥ wij > −0.01 are randomly initialized in [−0.01, −0.005], and λ is set to 0. 5.6.2 Weight Decay Details. We used Weigend et al.’s weight decay term: P w2 /w0 D(w) = i,j 1+wij 2 /w . As with FMS, D(w, w0 )’s gradient was normalized and ij
0
multiplied by the length of E(net(w), D0 )’s gradient. λ was adjusted as with FMS. Lost units were brought back as with FMS. 5.6.3 Modifications of OBS. Typically most weights exceed 1.0 after training. Therefore, higher-order terms of δw in the Taylor expansion of the error function do not vanish. Hence, OBS is not fully theoretically justified. Still, we used OBS to delete high weights, assuming that higher-order derivatives are small if second-order derivatives are. To obtain reasonable performance, we modified the original OBS procedure (notation following Hassibi and Stork 1993): • To detect the weight that deserves deletion, we use both Lq = (the original value used by Hassibi and Stork) and Tq := 1∂ E 2 2 ∂w2q wq . Here H 2
w2q [H−1 ]qq
∂E ∂wq wq
+
denotes the Hessian and H−1 its approximate inverse.
We delete the weight-causing minimal training set error (after tentative deletion). • As with OBD (LeCun et al. 1990), to prevent numerical errors due to small eigenvalues of H, we do: if Lq < 0.00001 or Tq < 0.00001 or kI − H−1 Hk > 10.0 (bad approximation of H−1 ), we delete only the weight detected in the previous step. The other weights remain the same. Here k · k denotes the sum of the absolute values of all components of a matrix. • If OBS’s adjustment of the remaining weights leads to at least one absolute weight change exceeding 5.0, then δw is scaled such that the maximal absolute weight change is 5.0. This leads to better performance (also due to small eigenvalues). • If Eaverage > Etol after weight deletion, then the net is retrained until either Eaverage < Etol or the number of training examples exceeds 800,000. Practical experience indicates that the choice of Etol barely influences the result. • OBS is stopped if Eaverage > Etol after retraining. The most recent weight deletion is countermanded.
18
Sepp Hochreiter and Jurgen ¨ Schmidhuber
6 Relation to Previous Work Most previous algorithms for finding low-complexity networks with high generalization capability are based on different prior assumptions. They can be broadly classified into two categories (see Schmidhuber 1994a for an exception): 1. Assumptions about the prior weight distribution. Hinton and van Camp (1993) and Williams (1994) assume that pushing the posterior weight distribution close to the weight prior leads to “good” generalization (see more details below). Weight decay (e.g., Hanson and Pratt 1989; Krogh and Hertz 1992) can be derived, for example, from gaussian or Laplace weight priors. Nowlan and Hinton (1992) assume that a distribution of networks with many similar weights generated by gaussian mixtures is “better” a priori. MacKay’s weight priors (1992b) are implicit in additional penalty terms, which embody the assumptions made. The problem with these approaches is that there may not be a “good” weight prior for all possible architectures and training sets. With FMS, however, we do not have to select a “good” weight prior but instead choose a prior over input-output functions. This automatically takes the net architecture and the training set into account. 2. Prior assumptions about how theoretical results on early stopping and network complexity carry over to practical applications. Such assumptions are implicit in methods based on validation sets (Mosteller and Tukey 1968; Stone 1974; Eubank 1988; Hastie and Tibshirani 1990), for example, generalized cross-validation (Craven and Wahba 1979; Golub et al. 1979), final prediction error (Akaike 1970), and generalized prediction error (Moody and Utans 1992; Moody 1992). See also Holden (1994), Wang et al. (1994), Amari and Murata (1993), and Vapnik’s structural risk minimization (Guyon et al. 1992; Vapnik 1992). 6.1 Constructive Algorithms and Pruning Algorithms. Other architecture selection methods are less flexible in the sense that they can be used only before or after weight adjustments. Examples are sequential network construction (Fahlman and Lebiere 1990; Ash 1989; Moody 1989), input pruning (Moody 1992; Refenes et al. 1994), unit pruning (White 1989; Mozer and Smolensky 1989; Levin et al. 1994), and weight pruning, for example, optimal brain damage (LeCun et al. 1990), and optimal brain surgeon (Hassibi and Stork 1993). 6.2 Hinton and van Camp (1993). They minimize the sum of twoRterms. The first is conventional error plus variance; the other is the distance p(w | p(w|D ) D0 ) log p(w)0 dw between posterior p(w | D0 ) and weight prior p(w). They
Flat Minima
19
have to choose a “good” weight prior. But perhaps there is no “good” weight prior for all possible architectures and training sets. With FMS, however, we do not depend on a “good” weight prior; instead we have a prior over input-output functions, thus taking into account net architecture and training set. Furthermore, Hinton and van Camp have to compute variances of weights and unit activations, which (in general) cannot be done using linear approximation. Intuitively, their weight variances are related to our 1wij . Our approach, however, does justify linear approximation, as seen in Section A.2. 6.3 Wolpert (1994a). His (purely theoretical) analysis suggests an interesting different additional error term (taking into account local flatness in all directions): the logarithm of the Jacobi determinant of the functional from weight space to the space of possible nets. This term is small if the net output (based on the current weight vector) is locally flat in weight space (if many neighboring weights lead to the same net function in the space of possible net functions). It is not clear, however, how to derive a practical algorithm (e.g., a pruning algorithm) from this. 6.4 Murray and Edwards (1993). They obtain additional error terms consisting of weight squares and second-order derivatives. Unlike our approach, theirs explicitly prefers weights near zero. In addition, their approach appears to require much more computation time (due to secondorder derivatives in the error term). 7 Limitations, Final Remarks, and Future Research 7.1 How to Adjust λ. Given recent trends in neural computing (see, e.g., MacKay 1992a, 1992b), it may seem like a step backward that λ is adapted using an ad hoc heuristic from Weigend et al. (1991). However, for determining λ in MacKay’s style, one would have to compute the Hessian of the cost function. Since our term B(w, X0 ) includes first-order derivatives, adjusting λ would require the computation of third-order derivatives. This is impracticable. Also, to optimize the regularizing parameter λ (see MacKay R 1992b), we need to compute the function dL w exp(−λB(w, X0 )), but it is not obvious how; the “quick and dirty version” (MacKay 1992a) cannot deal with the unknown constant ² in B(w, X0 ). Future work will investigate how to adjust λ without too much computational effort. In fact, as will be seen in Section A.1, the choices of λ and Etol are correlated. The optimal choice of Etol may indeed correspond to the optimal choice of λ. 7.2 Generalized Boxes. The boxes found by the current version of FMS are axis aligned. This may cause an underestimate of flat minimum volume. Although our experiments indicate that box search works very well, it
20
Sepp Hochreiter and Jurgen ¨ Schmidhuber
will be interesting to compare alternative approximations of flat minimum volumes. 7.3 Multiple Initializations. First, consider this FMS alternative: Run conventional BP starting with several random initial guesses and pick the flattest minimum with the largest volume. This does not work. Conventional BP changes the weights according to the steepest descent; it runs away from flat ranges in weight space. Using an “FMS committee” (multiple runs with different initializations), however, would lead to a better approximation of the posterior. This is left for future work. 7.4 Notes on Generalization Error. If the prior distribution of targets p( f ) (see Section A.1) is uniform (or if the distribution of prior distributions is uniform), no algorithm can obtain a lower expected generalization error than training error reducing algorithms (see, e.g., Wolpert 1994b). Typical target distributions in the real world are not uniform, however; the real world appears to favor problem solutions with low algorithmic complexity (see, e.g., Schmidhuber 1994a). MacKay (1992a) suggests searching for alternative priors if the generalization error indicates a “poor regularizer.” He also points out that with a “good” approximation of the nonuniform prior, more probable posterior hypotheses do not necessarily have a lower generalization error. For instance, there may be noise on the test set, or two hypotheses representing the same function may have different posterior values, and the expected generalization error ought to be computed over the whole posterior, not for a single solution. Schmidhuber (1994b) proposes a general “self-improving” system whose entire life is viewed as a single training sequence and continually attempts to modify its priors incrementally based on experience with previous problems (see also Schmidhuber 1996). 7.5 Ongoing Work on Low-Complexity Coding. FMS can also be useful for unsupervised learning. In recent work, we postulate that a generally useful code of given input data fulfills three MDL-inspired criteria: (1) It conveys information about the input data, (2) it can be computed from the data by a low-complexity mapping, and (3) the data can be computed from the code by a low-complexity mapping. To obtain such codes, we simply train an auto-associator with FMS (after training, codes are represented across the hidden units). In initial experiments, depending on data and architecture, this always led to well-known kinds of codes considered useful in previous work by numerous researchers. We sometimes obtained factorial codes, sometimes local codes, and sometimes sparse codes. In most cases, the codes were of the low-redundancy, binary kind. Initial experiments with a speech data benchmark problem (vowel recognition) already showed the true usefulness of codes obtained by FMS: Feeding the codes into standard,
Flat Minima
21
supervised, overfitting backpropagation classifiers, we obtained much better generalization performance than competing approaches. Appendix – Theoretical Justification An alternative version of this Appendix (but with some minor errors) can be found in Hochreiter and Schmidhuber (1994). An expanded version of this Appendix (with a detailed description of the algorithm) is available on the World-Wide Web (see our home pages). A.1. Flat Nets: The Most Probabale Hypotheses—Outline. We introduce a novel kind of generalization error that can be split into an overfitting error and an underfitting error. To find hypotheses causing low generalization error, we first select a subset of hypotheses causing low underfitting error. We are interested in those of its elements causing low overfitting error. After listing relevant definitions we will introduce a somewhat unconventional variant of the Gibbs algorithm, designed to take into account that FMS uses only the training data D0 to determine G(· | D0 ), a distribution over the set of hypotheses expressing our prior belief in hypotheses (here we do not care where the data came from; this will be treated later). This variant of the Gibbs algorithm will help us to introduce the concept of expected extended generalization error, which can be split into an overfitting error (relevant for measuring whether the learning algorithm focuses too much on the training set) and an underfitting error (relevant for measuring whether the algorithm sufficiently approximates the training set). To obtain these errors, we measure the Kullback-Leibler distance between posterior p(· | D0 ) after training on the training set and posterior pD0 (· | D) after (hypothetical) training on all data (here the subscript D0 indicates that for learning D, G(· | D0 ) is used as prior belief in hypotheses too). The overfitting error measures the information conveyed by p(· | D0 ) but not by pD0 (· | D). The underfitting error measures the information conveyed by pD0 (· | D), but not by p(· | D0 ). We then introduce the tolerable error level and the set of acceptable minima. The latter contains hypotheses with low underfitting error, assuming that D0 indeed conveys information about the test set (every training-seterror-reducing algorithm makes this assumption). In the remainder of the Appendix, we will focus only on hypotheses within the set of acceptable minima. We introduce the relative overfitting error, which is the relative contribution of a hypothesis to the mean overfitting error on the set of acceptable minima. The relative overfitting error measures the overfitting error of hypotheses with low underfitting error. The goal is to find a hypothesis with low overfitting error and, consequently, low generalization error. The relative overfitting error is approximated based on the trade-off between low training set error and large values of G(· | D0 ). The distribution
22
Sepp Hochreiter and Jurgen ¨ Schmidhuber
G(· | D0 ) is restricted to the set of acceptable minima, to obtain the distribution GM(D0 ) (· | D0 ). We then assume the data are obtained from a target chosen according to a given prior distribution. Using previously introduced distributions, we derive the expected test set error and the expected relative overfitting error. We want to reduce the latter by choosing a certain GM(D0 ) (· | D0 ) and G(· | D0 ). The special case of noise-free data is considered. To be able to minimize the expected relative overfitting error, we need to adopt a certain prior belief p( f ). The only unknown distributions required to determine GM(D0 ) (· | D0 ) are p(D0 | f ) and p(D | f ). They describe how (noisy) data are obtained from the target. We have to make the following assumptions: the choice of prior belief is “appropriate,” the noise on data drawn from the target has mean 0, and small noise is more probable than large noise (the noise assumptions ensure that reducing the training error— by choosing some h from M(D0 )—reduces the expected underfitting error). We do not need gaussian assumptions, though. We show that FMS approximates our special variant of the Gibbs algorithm. The prior is approximated locally in weight space, and flat net(w) are approximated by flat net(w0 ) with w0 near w in weight space. A.1.1. Definitions. Let A = {(x, y) | x ∈ X, y ∈ Y} be the set of all possible input-output pairs (pairs of vectors). Let NET be the set of functions that can be implemented by the network. For every net function g ∈ NET we have g ⊂ A. Elements of NET are parameterized with a parameter vector w from the set of possible parameters W. net(w) is a function that maps a parameter vector w onto a net function g (net is surjective). Let T be the set of target functions f , where T ⊂ NET. Let H be the set of hypothesis functions h, where H ⊂ T. For simplicity, take all sets to be finite, and let all functions map each x ∈ X to some y ∈ Y. Values of functions with argument x are denoted by g(x), net(w)(x), f (x), h(x). We have (x, g(x)) ∈ g; (x, net(w)(x)) ∈ net(w); (x, f (x)) ∈ f ; (x, h(x)) ∈ h. Let D = {(xp , yp ) | 1 ≤ p ≤ m} be the data, where D ⊂ A. D is divided into a training set D0 = {(xp , yp ) | 1 ≤ p ≤ n} and a test set D \ D0 = {(xp , yp ) | n < p ≤ m}. For the moment, we are not D was obtained. Pminterested in how kyp − h(xp )k2 , where k · k is the We use squared error E(D, h) := p=1 Pn Pm Euclidean norm. E(D0 , h) := p=1 kyp −h(xp )k2 . E(D\D0 , h) := p=n+1 kyp − h(xp )k2 . E(D, h) = E(D0 , h) + E(D \ D0 , h) holds. A.1.2. Learning. We use a variant of the Gibbs formalism (see Opper and Haussler 1991, or Levin et al. 1990). Consider a stochastic learning algorithm (random weight initialization, random learning rate). The learning algorithm attempts to reduce training set error by randomly selecting a hypothesis with low E(D0 , h), according to some conditional distribution
Flat Minima
23
G(· | D0 ) over H. G(· | D0 ) is chosen in advance, but in contrast to traditional Gibbs (which deals with unconditional distributions on H), we may take a look at the training set before selecting G. For instance, one training set may suggest linear functions as being more probable than others, another one splines, and so forth. The unconventional Gibbs variant is appropriate because FMS uses only X0 (the set of first components of D0 ’s elements; see Section 3) to compute the flatness of net(w0 ). The trade-off between the desire for low E(D0 , h) and the a priori belief in a hypothesis according to G(· | D0 ) is governed by a positive constant β (interpretable as the inverse temperature from statistical mechanics, or the amount of stochasticity in the training algorithm). We obtain p(h | D0 ), the learning algorithm applied to data D0 : p(h | D0 ) = where Z(D0 , β) =
G(h | D0 ) exp(−βE(D0 , h)) , Z(D0 , β) X
G(h | D0 ) exp(−βE(D0 , h)) .
(A.1)
(A.2)
h∈H
Z(D0 , β) is the error momentum generating function or the weighted accessible volume in configuration space or the partition function (from statistical mechanics). For theoretical purposes, assume we know D and may use it for learning. To learn, we use the same distribution G(h | D0 ) as above (prior belief in some hypotheses h is based exclusively on the training set). There is a reason that we do not use G(h | D) instead: G(h | D) does not allow for making a distinction between a better prior belief in hypotheses and a better approximation of the test set data. However, we are interested in how G(h | D0 ) performs on the test set data D \ D0 . We obtain pD0 (h | D) = where ZD0 (D, β) =
G(h | D0 ) exp(−βE(D, h)) , ZD0 (D, β) X
G(h | D0 ) exp(−βE(D, h)).
(A.3)
(A.4)
h∈H
The subscript D0 indicates that the prior belief is chosen based on D0 only. A.1.3. Expected Extended Generalization Error. We define the expected extended generalization error EG (D, D0 ) on the unseen test exemplars D\D0 : X EG (D, D0 ) := p(h | D0 )E(D \ D0 , h) (A.5) h∈H
−
X
h∈H
pD0 (h | D)E(D \ D0 , h).
24
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Here EG (D, D0 ) is the mean error on D \ D0 after learning with D0 , minus the mean error on D \ D0 after learning with D. The second (negative) term is a lower bound (due to nonzero temperature) for the error on D \ D0 after learning the training set D0 . For the zero temperature limit β → ∞ we get (summation convention explained at the end of this paragraph) X
EG (D, D0 ) =
h∈H,D0
G(h | D0 ) E(D \ D0 , h), Z(D0 ) ⊂h
where Z(D0 ) =
X
G(h | D0 ).
h∈H,D0 ⊂h
In this case, the generalization error depends on G(h | D0 ), restricted to those hypotheses h compatible with D0 (D0 ⊂ h). For β → 0 (full stochasticity), we get EG (D, D0 ) = 0. P Summation convention: In general, h∈H,D0 ⊂h denotes summation over those h satisfying h ∈ H and D0 ⊂ h. In what follows, we will keep an analogous convention: the first symbol is the running index, for which additional expressions specify conditions. A.1.4. Overfitting and Underfitting Error. Let us separate the generalization error into an overfitting error Eo and an underfitting error Eu (in analogy to Wang et al. 1994 and Guyon et al. 1992). We will see that overfitting and underfitting error correspond to the two different error terms in our algorithm: decreasing one term is equivalent to decreasing Eo , and decreasing the other is equivalent to decreasing Eu . Using the Kullback-Leibler distance (Kullback 1959), we measure the information conveyed by p(· | D0 ), but not by pD0 (· | D) (see Fig. 4). We may view this as information about G(· | D0 ): since there are more h that are compatible with D0 than there are h compatible with D, G(· | D0 )’s influence on p(h | D0 ) is stronger than its influence on pD0 (h | D). To get the nonstochastic bias (see definition of EG ), we divide this information by β and obtain the overfitting error: Eo (D, D0 ) := =
1X p(h | D0 ) p(h | D0 ) ln β h∈H pD0 (h | D) X h∈H
p(h | D0 )E(D \ D0 , h) +
(A.6)
ZD0 (D, β) 1 ln . β Z(D0 , β)
Analogously, we measure the information conveyed by pD0 (· | D), but not by p(· | D0 ) (see Fig. 5). This information is about D \ D0 . To get the nonstochastic bias (see the definition of EG ), we divide this information by
Flat Minima
25
Figure 4: Positive contributions to the overfitting error Eo (D, D0 ), after learning the training set with a large β.
β and obtain the underfitting error: Eu (D, D0 ) :=
1X pD (h | D) pD0 (h | D) ln 0 β h∈H p(h | D0 )
= −
X h∈H
pD0 (h | D)E(D \ D0 , h) +
(A.7) Z(D0 , β) 1 ln . β ZD0 (D, β)
Peaks in G(· | D0 ) that do not match peaks of pD0 (· | D) produced by D \ D0 lead to overfitting error. Peaks of pD0 (· | D) produced by D \ D0 that do not match peaks of G(· | D0 ) lead to underfitting error. Overfitting and underfitting error tell us something about the shape of G(· | D0 ) with respect to D \ D0 , that is, to what degree the prior belief in h is compatible with D \ D0 . A.1.5. Why Are They Called “Overfitting” and “Underfitting” Error? Positive contributions to the overfitting error are obtained where peaks of p(· | D0 ) do not match (or are higher than) peaks of pD0 (· | D): there some h will have large probability after training on D0 but lower probability after training on all data D. This is either because D0 has been approximated too closely or because of sharp peaks in G(· | D0 ); the learning algorithm specializes either on D0 or on G(· | D0 ) (“overfitting”). The specialization on D0 will become even worse if D0 is corrupted by noise (the case of noisy D0 will be treated later). Positive contributions to the underfitting error are obtained where peaks of pD0 (· | D) do not match (or are higher than) peaks of p(· | D0 ): there some h will have large probability after training on all data
26
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Figure 5: Positive contributions to the underfitting error Eu (D0 , D), after learning the training set with a small β. Again, we use the D-posterior from Figure 4, assuming it is almost fully determined by E(D, h) (even if β is smaller than in Figure 4).
D but will have lower probability after training on D0 . This is due to either a poor D0 approximation (note that p(· | D0 ) is almost fully determined by G(· | D0 )), or to insufficient information about D conveyed by D0 (“underfitting”). Either the algorithm did not learn “enough” of D0 , or D0 does not tell us anything about D. In the latter case, there is nothing we can do; we have to focus on the case where we did not learn enough about D0 . A.1.6. Analysis of Overfitting and Underfitting Error. EG (D, D0 ) = Eo (D, )+E D u (D, D0 ) holds. For zero temperature P limit β → ∞ we obtain ZD0 (D) = P0 G(h | D ) and Z(D ) = 0 0 h∈H,D0 ⊂h G(h | D0 ). Eo (D, D0 ) = Ph∈H,D⊂h G(h|D0 ) h∈H,D0 ⊂h Z(D0 ) E(D \ D0 , h) = EG (D, D0 ). Eu (D, D0 ) = 0; that is, there is no underfitting error. For β → 0 (full stochasticity) we get Eu (D, D0 ) = 0 and Eo (D, D0 ) = 0 (recall that EG is not the conventional but the extended expected generalization error). Since D0 ⊂ D, ZD0 (D, β) < Z(D0 , β) holds. In what follows, averages after learning on D0 are denoted by hD0 .i, and averages after learning on D are denoted by hD ·i. Since
ZD0 (D, β) =
X h∈H
G(h | D0 ) exp(−βE(D0 , h)) exp(−βE(D \ D0 , h)),
Flat Minima
27
we have X ZD0 (D, β) = p(h | D0 ) exp(−βE(D \ D0 , h)) Z(D0 , β) h∈H = hD0 exp(−βE(D \ D0 , ·))i . Analogously, we have hD0E(D \ D0 , ·)i +
1 β
Z(D0 ,β) ZD0 (D,β)
= hD exp(βE(D \ D0 , ·))i. Thus, Eo (D, D0 ) =
lnhD0 exp(−βE(D \ D0 , ·))i, and Eu (D, D0 ) = −hD E(D \
D0 , ·)i + lnhD exp(βE(D \ D0 , ·))i.2 With large β, after learning on D0 , Eo measures the difference between average test set error and a minimal test set error. With large β, after learning on D, Eu measures the difference between average test set error and a maximal test set error. Assume we do have a 1 β
ZD (D,β)
0 ). We have to large β (large enough to exceed the minimum of β1 ln Z(D 0 ,β) assume that D0 indeed conveys information about the test set. Preferring hypotheses h with small E(D0 , h) by using a larger β leads to smaller test set error (without this assumption, no error-decreasing algorithm would make sense). Eu can be decreased by enforcing less stochasticity (by further increasing β), but this will increase Eo . Similarly, decreasing β (enforcing more stochasticity) will decrease Eo but increase Eu . Increasing β decreases the maximal test set error after learning D more than it decreases the average test set error, thus decreasing Eu , and vice versa. Decreasing β increases the minimal test set error after learning D0 more than it increases the average test set error, thus decreasing Eo , and vice versa. This is the trade-off between stochasticity and fitting the training set, governed by β.
A.1.7. Tolerable Error Level and Set of Acceptable Minima. Let us implicitly define a tolerable error level Etol (α, β) that, with confidence 1 − α, is the upper bound of the training set error after learning: X
p(E(D0 , h) ≤ Etol (α, β)) =
p(h | D0 ) = 1 − α .
(A.8)
h∈H,E(D0 ,h)≤Etol (α,β)
With (1−α)-confidence, we have E(D0 , h) ≤ Etol (α, β) after learning. Etol (α, β) decreases with increasing β, α. Now we define M(D0 ) := {h ∈ H | E(D0 , h) ≤ Etol (α, β)}, which is the set of acceptable minima (see Section 2). The set of acceptable minima is a set of hypotheses with low underfitting error. 2
We have −
∂ 2 ln Z(D0 ,β) ∂β 2
∂ ln Z(D0 ,β) ∂β
∂ ln ZD0 (D,β) ∂β ∂ 2 ln ZD0 (D,β)
= hD0 E(D0 , ·)i and − , ·)i)2 i
= hD E(D, ·)i. Furthermore,
= hD0 (E(D0 , ·) − hD0 E(D0 and = hD (E(D, ·) − hD E(D, ·)i)2 i. ∂β 2 See also Levin et al. (1990). Using these expressions, it can be shown: by increasing β (starting from β = 0), we will find a β that minimizes further makes this expression go to 0.
1 β
ln
ZD0 (D,β) Z(D0 ,β)
< 0. Increasing β
28
Sepp Hochreiter and Jurgen ¨ Schmidhuber
With probability 1 − α, the learning algorithm selects a hypothesis from M(D0 ) ⊂ H. Note that for the zero temperature limit β → ∞, we have Etol (α) = 0 and M(D0 ) = {h ∈ H | D0 ⊂ h}. By fixing a small Etol (or a large β), Eu will be forced to be low. We would like to have an algorithm decreasing (1) training set error (this corresponds to decreasing underfitting error) and (2) an additional error term, which should be designed to ensure low overfitting error, given a fixed small Etol . The remainder of Section A.1 will lead to an answer as to how to design this additional error term. Since low underfitting is obtained by selecting a hypothesis from M(D0 ), in what follows we will focus on M(D0 ) only. Using an appropriate choice of prior belief, at the end of this section, we will finally see that the overfitting error can be reduced by an error term expressing preference for flat nets. A.1.8. Relative Overfitting Error. Let us formally define the relative overfitting error Ero , which is the relative contribution of some h ∈ M(D0 ) to the mean overfitting error of hypotheses set M(D0 ): E (D, D0 , M(D0 ), h) = pM(D0 ) (h | D0 )E(D \ D0 , h), where ro p(h | D0 ) h∈M(D0 ) p(h | D0 )
pM(D0 ) (h | D0 ) := P
(A.9) (A.10)
for h ∈ M(D0 ), and zero otherwise. For h ∈ M(D0 ), we approximate p(h | D0 ) as follows. We assume that G(h | D0 ) is large where E(D0 , h) is large (trade-off between low E(D0 , h) and G(h | D0 )). Then p(h | D0 ) has large values (due to large G(h | D0 )) where E(D0 , h) ≈ Etol (α, β) (assuming Etol (α, β) is small). We get p(h | D0 ) ≈
G(h | D0 ) exp(−βEtol (α, β)) Z(D0 , β)
The relative overfitting error can now be approximated by Ero (D, D0 , M(D0 ), h) ≈ P
G(h | D0 ) E(D \ D0 , h) . h∈M(D0 ) G(h | D0 )
(A.11)
To obtain a distribution over M(D0 ), we introduce GM(D0 ) (· | D0 ), the normalized distribution G(· | D0 ) restricted to M(D0 ). For approximation A.11 we have Ero (D, D0 , M(D0 ), h) ≈ GM(D0 ) (h | D0 )E(D \ D0 , h) .
(A.12)
A.1.9. Prior Belief in f and D. Assume D was obtained from a target function f . Let p( f ) be the prior on targets and p(D | f ) the probability of
Flat Minima
29
obtaining D with a given f . We have p( f | D0 ) =
p(D0 | f )p( f ) , p(D0 )
(A.13)
P where p(D0 ) = f ∈T p(D0 | f )p( f ) . The data are drawn from a target function with added noise (the noisefree case is treated below). We do not make any assumptions about the nature of the noise; it does not have to be gaussian (as in MacKay’s work, 1992b). We want to select a G(· | D0 ) that makes Ero small; that is, those h ∈ M(D0 ) with small E(D \ D0 , h) should have high probabilities G(h | D0 ). We do not know D \ D0 during learning. D is assumed to be drawn from a target f . We compute the expectation of Ero , given D0 . The probability of the test set D \ D0 , given D0 , is p(D \ D0 | D0 ) =
X
p(D \ D0 | f )p( f | D0 ),
(A.14)
f ∈T
where we assume p(D \ D0 | f, D0 ) = p(D \ D0 | f ) (we do not remember which exemplars were already drawn). The expected test set error E(·, h) for some h, given D0 , is X
p(D \ D0 | D0 )E(D \ D0 , h)
D\D0
=
X f ∈T
p( f | D0 )
X
(A.15)
p(D \ D0 | f )E(D \ D0 , h) .
D\D0
The expected relative overfitting error Ero (·, D0 , M(D0 ), h) is obtained by inserting equation A.15 into equation A.12: Ero (·, D0 , M(D0 ), h) X X ≈ GM(D0 ) (h | D0 ) p( f | D0 ) p(D \ D0 | f )E(D \ D0 , h). f ∈T
(A.16)
D\D0
A.1.10. Minimizing Expected Relative Overfitting Error. We define a GM(D0 ) (· | D0 ) such that GM(D0 ) (· | D0 ) has its largest value near small expected test set error E(·, ·) (see equations A.12 and A.15). This definition leads to a low expectation of Ero (·, D0 , M(D0 ), ·) (see equation A.16). Define GM(D0 ) (h | D0 ) := δ(argminh0 ∈M(D0 ) (E(·, h0 )) − h),
(A.17)
where δ is the Dirac delta function, which we will use with loose formalism; the context will make clear how the delta function is used.
30
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Using equation A.15 we get GM(D0 ) (h | D0 )
X
= δ argminh0 ∈M(D0 )
X
p( f | D0 )
f ∈T
(A.18) p(D \ D0 | f )E(D \ D0 , h0 ) − h .
D\D0
GM(D0 ) (· | D0 ) determines the hypothesis h from M(D0 ) that leads to the lowest expected test set error. Consequently, we achieve the lowest expected relative overfitting error. GM(D0 ) helps us define G: G(h | D0 ) := P
ζ + GM(D0 ) (h | D0 ) , h∈H (ζ + GM(D0 ) (h | D0 ))
(A.19)
/ M(D0 ), and where ζ is a small constant where GM(D0 ) (h | D0 ) = 0 for h ∈ ensuring positive probability G(h | D0 ) for all hypotheses h. To appreciate the importance of the prior p( f ) in the definition of GM(D0 ) (see also equation A.24), in what follows, we will focus on the noise-free case. Let p(D0 | f ) be equal to
A.1.11. The Special Case of Noise-Free Data. δ(D0 ⊂ f ) (up to a normalizing constant): δ(D0 ⊂ f )p( f ) . p( f | D0 ) = P f ∈T,D0 ⊂ f p( f )
(A.20)
δ(D\D0 ⊂ f ) Assume p(D \ D0 | f ) = P
. Let F be the number of elements P δ(D\D0 ⊂ f ) in X. p(D \ D0 | f ) = . We expand D\D0 p(D \ D0 | f )E(D \ D0 , h) 2F−n from equation A.15: D\D0
1 2F−n
X
δ(D\D0 ⊂ f )
E(D \ D0 , h) =
D\D0 ⊂ f
= =
1 2F−n 1 2F−n
X
X
E((x, y), h)
(A.21)
D\D0 ⊂ f (x,y)∈D\D0
X
E((x, y), h)
(x,y)∈ f \D0
¶ F−n µ X F−n−1 i=1
i−1
1 E( f \ D0 , h) . 2
P Here E((x, y), h) = ky − h(x)k2 , E( f \ D0 , h) = (x,y)∈ f \D0 ky − h(x)k2 , and PF−n ¡F−n−1¢ = 2F−n−1 . The factor 1/2 results from considering the mean i=1 i−1
Flat Minima
31
test set error (where the test set is drawn from f ), whereas E( f \ D0 , h) is the maximal test set error (obtained by using a maximal test set). From equations A.15 and A.21, we obtain the expected test set error E(·, h) for some h, given D0 : X
p(D \ D0 | D0 )E(D \ D0 , h) =
D\D0
1X p( f | D0 )E( f \ D0 , h) . 2 f ∈T
(A.22)
From equations A.22 and A.12, we obtain the expected Ero (·,D0 ,M(D0 ),h): Ero (·, D0 , M(D0 ), h) ≈
1 2
GM(D0 ) (h | D0 )
P f ∈T
(A.23) p( f | D0 )E( f \ D0 , h) .
For GM(D0 ) (h | D0 ) we obtain in this noise-free case GM(D0 ) (h | D0 ) =
X
δ argminh0 ∈M(D0 )
(A.24)
p( f | D0 )E( f \ D0 , h0 ) − h .
f ∈T
The lowest expected test set error is measured by D0 , h). See equation A.22.
1 2
P
f ∈T
p( f | D0 )E( f \
A.1.12. Noisy Data and Noise-Free Data: Conclusion. For both the noisefree and the noisy case, equation A.14 shows that given D0 and h, the expected test set error depends on prior target probability p( f ). A.1.13. Choice of Prior Belief. Now we select some p( f ), our prior belief in target f . We introduce a formalism similar to Wolpert’s (1994a ). p( f ) is defined as the probability of obtaining f = net(w) by choosing a w randomly according to p(w). R Let us first look at Wolpert’s formalism: p( f ) = dwp(w)δ(net(w) − f ). By restricting W to Winj , he obtains an injective function netinj : Winj → NET : netinj (w) = net(w) , which is net restricted to Winj . netinj is surjective (because net is surjective): Z δ(netinj (w) − f ) |det net0inj (w)|dw p(w) (A.25) p( f ) = |det net0inj (w)| W Z δ(g − f ) = dg p(net−1 inj (g)) |det net0inj (net−1 NET inj (g))| =
p(net−1 inj ( f )) |det net0inj (net−1 inj ( f ))|
,
32
Sepp Hochreiter and Jurgen ¨ Schmidhuber
where |det net0inj (w)| is the absolute Jacobian determinant of netinj , evaluated at w. If there is a locally flat net(w) = f (flat around w), then p( f ) is high. However, we prefer to follow another path. Our algorithm (flat minimum search) tends to prune a weight wi if net(w) is very flat in wi ’s direction. It prefers regions where det net0 (w) = 0 (where many weights lead to the same net function). Unlike Wolpert’s approach, ours distinguishes the probabilities of targets f = net(w) with det net0 (w) = 0. The advantage is that we do not only search for net(w) that are flat in one direction but for net(w) that are flat in many directions (this corresponds to a higher probability of the corresponding targets). Define net−1 (g) := {w ∈ W | net(w) = g}
(A.26)
and X
p(net−1 (g)) :=
p(w) .
(A.27)
w∈net−1 (g)
We have P p( f ) = P
g∈NET f ∈T
P
p(net−1 (g))δ(g − f )
g∈NET
p(net−1 (g))δ(g
− f)
= P
p(net−1 ( f )) . (A.28) −1 f ∈T p(net ( f ))
net partitions W into equivalence classes. To obtain p( f ), we compute the probability of w being in the equivalence class {w | net(w) = f }, if randomly chosen according to p(w). An equivalence class corresponds to a net function; net maps all w of an equivalence class to the same net function. A.1.14. Relation to FMS Algorithm. FMS (from Section 3) works locally in weight space W. Let w0 be the actual weight vector found by FMS (with h = net(w0 )). Recall the definition of GM(D0 ) (h | D0 ) (see equations A.17 and A.18); we want to find a hypothesis h that best approximates those f with large p( f ) (the test data have a high probability of being drawn from such targets). We will see that those f = net(w) with flat net(w) locally have high probability p( f ). Furthermore we will see that a w0 close to w with flat net(w) has flat net(w0 ) too. To approximate such targets f , the only thing we can do is find a w0 close to many w with net(w) = f and large p( f ). To justify this approximation (see definition of p( f | D0 ) while recalling that h ∈ GM(D0 ) ), we assume that the noise has mean 0 and that small noise is more likely than large noise (e.g., gaussian, Laplace, Cauchy distributions). To restrict p( f ) = p(net(w)) to a local range in W, we define regions of equal net functions F(w) = {w¯ | ∀τ 0 ≤ τ ≤ 1, w + τ (w¯ − w) ∈ W : net(w) = net(w + τ (w¯ − w))}. Note that F(w) ⊂ net−1 (net(w)). If net(w) is flat along long distances in many directions w¯ − w, then F(w) has many elements. Locally in weight space, at w0 with h = net(w0 ), for γ > 0 we define: if the
Flat Minima
33
¯ = f } exists, then minimum w = argminw¯ {kw¯ − w0 k | kw¯ − w0 k < γ , net(w) pw0 ,γ ( f ) = c p(F(w)), where c is a constant. If this minimum does not exist, then pw0 ,γ ( f ) = 0. pw0 ,γ ( f ) locally approximates p( f ). During search for w0 (corresponding to a hypothesis h = net(w0 )), to decrease locally the expected test set error (see equation A.15), we want to enter areas where many large F(w) are near w0 in weight space. We wish to decrease the test set error, which is caused by drawing data from highly probable targets f (those with large pw0 ,γ ( f )). We do not know, however, which w’s are mapped to target’s f by net(·). Therefore, we focus on F(w) (w near w0 in weight space), instead of pw0 ,γ ( f ). Assume kw0 − wk is small enough to allow for a Taylor expansion, and that net(w0 ) is flat in direction (w¯ − w0 ): net(w) = net(w0 + (w − w0 )) = net(w0 ) + ∇net(w0 )(w − w0 ) 1 + (w − w0 )H(net(w0 ))(w − w0 ) + · · · , 2 where H(net(w0 )) is the Hessian of net(·) evaluated at w0 , ∇net(w)(w¯ − w0 ) = ∇net(w0 )(w¯ − w0 ) + O(w − w0 ), and (w¯ − w0 )H(net(w))(w¯ − w0 ) = (w¯ − w0 )H(net(w0 ))(w¯ − w0 ) + O(w − w0 ) (analogously for higher-order derivatives). We see that in a small environment of w0 , there is flatness in direction (w¯ − w0 ) too. And, if net(w0 ) is not flat in any direction, this property also holds within a small environment of w0 . Only near w0 with flat net(w0 ), there may exist w with large F(w). Therefore, it is reasonable to search for a w0 with h = net(w0 ), where net(w0 ) is flat within a large region. This means searching for the h determined by GM(D0 ) (· | D0 ) of equation A.17. Since h ∈ M(D0 ), E(D0 , net(w0 )) ≤ Etol holds, we search for a w0 living within a large connected region, where for all w within this region P E(net(w0 ), net(w), X) = x∈X knet(w0 )(x) − net(w)(x)k2 ≤ ², where ² is defined in Section 2. To conclude, we decrease the relative overfitting error and the underfitting error by searching for a flat minimum (see the definition of flat minima in Section 2). A.1.15. Practical Realization of the Gibbs Variant. 1. Select α and Etol (α, β), thus implicitly choosing β. 2. Compute the set M(D0 ). 3. Assume we know how data are obtained from target f ; that is, we know p(D0 | f ), p(D\D0 | f ), and the prior p( f ). Then we can compute GM(D0 ) (· | D0 ) and G(· | D0 ). 4. Start with β = 0 and increase β until equation A.8 holds. Now we know the β from the implicit choice above.
34
Sepp Hochreiter and Jurgen ¨ Schmidhuber
5. Since we know all we need to compute p(h | D0 ), select some h according to this distribution. A.1.16. Three Comments on Certain FMS Limitations. 1. FMS only approximates the Gibbs variant given by the definition of GM(D0 ) (h | D0 ) (see equations A.17 and A.18). We only locally approximate p( f ) in weight space. If f = net(w) is locally flat around w, then there exist units or weights that can be given with low precision (or can be removed). If there are other weights wi with net(wi ) = f , then one may assume that there are also points in weight space near such wi where weights can be given with low precision (think of, for example, symmetrical exchange of weights and units). We assume the local approximation of p( f ) is good. The most probable targets represented by flat net(w) are approximated by a hypothesis h, which is also represented by a flat net(w0 ) (where w0 is near w in weight space). To allow for approximation of net(w) by net(w0 ), we have to assume that the hypothesis set H is dense in the target set T. If net(w0 ) is flat in many directions, then there are many net(w) = f that share this flatness and are well approximated by net(w0 ). The only reasonable thing FMS can do is to make net(w0 ) as flat as possible in a large region around w0 , to approximate the net(w) with large prior probability (recall that flat regions are approximated by axis-aligned boxes, as discussed in Section 7.2). This approximation is fine if net(w0 ) is smooth enough in “unflat” directions (small changes in w0 should not result in very different net functions). 2. Concerning point 3 above. p( f | D0 ) depends on p(D0 | f ) (how the training data are drawn from the target; see equation A.13). GM(D0 ) (h | D0 ) depends on p( f | D0 ) and p(D\D0 | f ) (how the test data are drawn from the target). Since we do not know how the data are obtained, the quality of the approximation of the Gibbs algorithm may suffer from noise that has not mean 0, or from large noise being more probable than small noise. Of course, if the choice of prior belief does not match the true target distribution, the quality of GM(D0 ) (h | D0 )’s approximation will suffer as well. 3. Concerning point 5 above. FMS outputs only a single h instead of p(h | D0 ). This issue is discussed in Section 7.3. A.1.17. Conclusion. Our FMS algorithm from Section 3 only approximates the Gibbs algorithm variant. Two important assumptions are made. The first is that an appropriate choice of prior belief has been made. The second is that the noise on the data is not too “weird” (mean 0, small noise more likely). The two assumptions are necessary for any algorithm based
Flat Minima
35
on an additional error term besides the training error. The approximations are that p( f ) is approximated locally in weight space, and flat net(w) are approximated by flat net(w0 ) with w0 near w’s. Our Gibbs variant takes into account that FMS uses only X0 for computing flatness. A.2. Why Does the Hessian Decrease? This section shows that secondorder derivatives of the output function vanish during flat minimum search. This justifies the linear approximations in Section 4. A.2.1. Intuition. We show that the algorithm tends to suppress the following values: unit activations, first-order activation derivatives, and the sum of all contributions of an arbitrary unit activation to the net output. Since weights, inputs, activation functions, and their first- and second-order derivatives are bounded, the entries in the Hessian decrease where the corresponding |δwij | increase. A.2.2. Formal Details. We consider a strictly layered feedforward network with K output units and g layers. We use the same activation function f for all units. For simplicity, in what follows we focus on a single input vector xp . xp (and occasionally w itself) will be notationally suppressed. We have ( yj ∂yl 0 = f (sl ) P ∂ym ∂wij m wlm wij
)
for i = l for i 6= l
,
(A.29)
P where ya denotes the activation of the ath unit, and sl = m wlm ym . The last term of equation 3.1 (the “regulator”) expresses output sensitivity (to be minimized) with respect to simultaneous perturbations of all weights. “Regulation” is done by equalizing the sensitivity of the output units with respect to the weights. The “regulator” does not influence the same particular units or weights for each training example. It may be ignored for the purposes of this section. Of course, the same holds for the first (constant) term in equation 3.1. We are left with the second term. With equation A.29, we obtain: X
log
i,j
X k
=2
Ã
∂ok ∂wij
!2
X
(fan-in of unit k) log | f 0 (sk )|
unit k in the gth layer
+2
X unit j in the (g−1)th layer
(fan-out of unit j) log |y j |
36
Sepp Hochreiter and Jurgen ¨ Schmidhuber
X
+
(fan-in of unit j) log
unit j in the (g−1)th layer
f 0 (sk )wkj
¢2
k
X
+2
X¡
(fan-in of unit j) log | f 0 (sj )|
unit j in the (g−1)th layer
X
+2
(fan-out of unit j) log |y j |
unit j in the (g−2)th layer
X
+
(fan-in of unit j) log
unit j in the (g−2)th layer
à f 0 (sk )
×
X
X k
!2 f 0 (sl )wkl wlj
l
X
+2
(fan-in of unit j) log | f 0 (sj )|
unit j in the (g−2)th layer
X
+2
(fan-out of unit j) log |y j |
unit j in the (g−3)th layer
X
+
log
i,j, where unit i in a layer <(g−2)
à ×
0
f (sk )
X l1
0
f (sl1 )wkl1
X l2
X k
∂yl2 wl1 l2 ∂wij
!2 . (A.30)
Let us have a closer look at this equation. We observe: 1. Activations of units decrease in proportion to their fan-outs. 2. First-order derivatives of the activation functions decrease in proportion to their fan-ins. P P P P 3. A term of the form k ( f 0 (sk ) l1 f 0 (sl1 )wkl1 l2 · · · lr f 0 (slr )wlr−1 lr wlr j )2 expresses the sum of unit j’s squared contributions to the net output. Here r ranges over {0, 1, . . . , g−2}, Pand unit j is in the (g−1−r)th layer (for the special case r = 0, we get k ( f 0 (sk )wkj )2 ). These terms also decrease in proportion to unit j’s fan-in. Analogously, equation A.30 can be extended to the case of additional layers. A.2.3. Comment. Let us assume that f 0 (sj ) = 0 and f (sj ) = 0 is “difficult to achieve” (can be achieved only by fine-tuning all weights on connections to unit j). Instead of minimizing | f (sj )| or | f 0 (sj )| by adjusting the net input of unit j (this requires fine-tuning of many weights), our algorithm prefers pushing weights wkl on connections to output units toward zero (other weights are less affected). On the other hand, if f 0 (sj ) = 0 and f (sj ) = 0
Flat Minima
37
is not difficult to achieve, then, unlike weight decay, our algorithm does not necessarily prefer weights close to zero. Instead, it prefers (possibly very strong) weights that push f (sj ) or f 0 (sj ) toward zero (e.g., with sigmoid units active in [0,1]: strong inhibitory weights are preferred; with gaussian units: high absolute weight values are preferred). See the experiment in Section 5.2. A.2.4. How Does This Influence the Hessian? The entries in the Hessian corresponding to output ok can be written as follows: ∂ 2 ok f 00 (sk ) ∂ok ∂ok = 0 ∂wij ∂wuv ( f (sk ))2 ∂wij ∂wuv à ! X ∂ 2 yl ∂y j ∂yv 0 ¯ ¯ + f (sk ) wkl + δik + δuk , ∂wij ∂wuv ∂wuv ∂wij l
(A.31)
where δ¯ is the Kronecker delta. Searching for big boxes, we run into regions of acceptable minima with ok ’s close to target (Section 2). Thus, by scaling f 00 (s ) the targets, ( f 0 (s k))2 can be bounded. Therefore, the first term in equation A.31 k decreases during learning. According to the analysis above, the first-order derivatives in the second term of equation A.31 are pushed toward zero. So are the wkl of the sum in the second term of equation A.31. The only remaining expressions of interest are second-order derivatives ∂ 2 yl
of units in layer (g − 1). The ∂wij ∂wuv are bounded if the weights, the activation functions, their first- and second-order derivatives, and the inputs are bounded. This is indeed the case, as will be shown for networks with one or two hidden layers: In case 1, for unit l in a single hidden layer (g = 3), we obtain ¯ ¯ ¯ ∂ 2 yl ¯ ¯ ¯ (A.32) ¯ = |δ¯li δ¯lu f 00 (sl )y j yv | < C1 , ¯ ¯ ∂wij ∂wuv ¯ where y j , yv are the components of an input vector xp , and C1 is a positive constant. In case 2, for unit l in the third layer of a net with two hidden layers (g = 4), we obtain ¯ ¯ ¯ ∂ 2 yl ¯ ¯ ¯ ¯ = | f 00 (sl )(wli y j + δ¯il y j )(wlu yv + δ¯ul yv ) ¯ ¯ ∂wij ∂wuv ¯ + f 0 (sl )(wli δ¯iu f 00 (si )y j yv + δ¯il δ¯uj f 0 (sj )yv + δ¯ul δ¯iv f 0 (sv )y j )| < C2 ,
(A.33)
38
Sepp Hochreiter and Jurgen ¨ Schmidhuber
where C2 is a positive constant. Analogously, the boundedness of secondorder derivatives can be shown for additional hidden layers. k decrease A.2.5. Conclusion. As desired, our algorithm makes the Hij,uv where |δwij | or |δwuv | increase.
A.3. Explicit Derivative of Equation 3.1. We explicitly compute the derivatives of equation 3.1. For simplicity, in what follows we focus on a single input vector xp . Again, xp (and occasionally w itself) will be notationally suppressed. The derivative of the right-hand side of equation 3.1 is: ∂B(w,xp ) ∂wuv
=
P
∂ok ∂ 2 ok k ∂wij ∂wij ∂wuv 2 ∂om m ∂wij
P
P ³
i,j
´
¯ ¯ P ³ ∂om ´2 ∂ok P ∂om ¯ k ¯ ∂o ³ ´ ∂ 2 ok − ∂wij P m ∂wij P q ¯ ∂wij ¯ P sign ∂ok ∂wij ∂wuv m ∂wij P ∂om 2 i,j ∂wij µ ¶ 32 k i,j ³ ´ 2 ) ( P ∂w m m
ij
+L
2
m
∂o ∂wij
∂ 2 om ∂wij ∂wuv
.
¯ ¯ ¯ ∂ok ¯ ¯ ∂wij ¯ P P r k i,j P ³ ´2 m
∂om ∂wij
(A.34)
To compute equation 3.2, we need ∂ok ∂w
∂B(w,xp ) ∂
³
ij ´ = P ³ ∂om ´2 ∂ok
∂wij
m
∂wij
¯ m¯ ¯ 2 ∂om ∂ok ∂om ³ ´ δ¯mk P ( ∂w ∂o ¯ ) − P ∂wij ∂wij ¯ P q ¯ ∂w m ij ∂om lr sign ∂wij µ ¶ 32 m l,r P ³ ´ ¯ 2 m 2 ∂o P ) ( ¯ m ¯ ∂wlr m
+L
P m
P l,r
2
¯ m
∂o ∂wij
,
(A.35)
m | ∂o | ∂wlr
qP ¡ ¢ ¯ 2 ∂om ¯ m
∂wlr
where δ¯ is the Kronecker delta. Using the nabla operator and equation A.35, we can compress equation A.34: ¶ µ X Hk ∇ ∂ok B(w, xp ) , (A.36) ∇uv B(w, xp ) = k
∂wij
Flat Minima
39
where Hk is the Hessian of the output ok . Since the sums over l, r in equation A.35 need to be computed only once (the results are reusable for all i, j), ∇ ∂ok B(w, xp ) can be computed in O(L) time. The product of the Hessian and ∂wij
a vector can be computed in O(L) time. With constant number of output units, the computational complexity of our algorithm is O(L). The long version of this paper (available on the World-Wide Web; see our home pages) contains pseudo-code of an efficient implementation. It is based on fast multiplication of the Hessian and a vector due to Pearlmutter (1994) and Møller (1993). Acknowledgments We thank David Wolpert and several anonymous referees for numerous comments that helped to improve previous drafts of this article. This work was supported by DFG grant SCHM 942/3-1 from Deutsche Forschungsgemeinschaft. References Akaike, H. 1970. Statistical predictor identification. Ann. Inst. Statist. Math. 22, 203–217. Amari, S., and Murata, N. 1993. Statistical theory of learning curves under entropic loss criterion. Neural Computation 5(1), 140–153. Ash, T. 1989. Dynamic node creation in backpropagation neural networks. Connection Science 1(4), 365–375. Bishop, C. M. 1993. Curvature-driven smoothing: A learning algorithm for feed-forward networks. IEEE Transactions on Neural Networks 4(5), 882–884. Buntine, W. L., and Weigend, A. S. 1991. Bayesian back-propagation. Complex Systems 5, 603–643. Carter, M. J., Rudolph, F. J., and Nucci, A. xJ. 1990. Operational fault tolerance of CMAC networks. In Advances in Neural Information Processing Systems 2, pp. 340–347. Morgan Kaufmann, San Mateo, CA. Craven, P., and Wahba, G. 1979. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numer. Math. 31, 377–403. Eubank, R. L. 1988. Spline smoothing and nonparametric regression. In SelfOrganizing Methods in Modeling, S. Farlow, ed. Marcel Dekker, New York. Fahlman, S. E., and Lebiere, C. 1990. The cascade-correlation learning algorithm. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 525–532. Morgan Kaufmann, San Mateo, CA. Golub, G., Heath, H., and Wahba, G. 1979. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21, 215–224. Guyon, I., Vapnik, V., Boser, B., Bottou, L., and Solla, S. A. 1992. Structural risk minimization for character recognition. In Advances in Neural Information
40
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippman, eds., pp. 471–479. Morgan Kaufmann, San Mateo, CA. Hanson, S. J., and Pratt, L. Y. 1989. Comparing biases for minimal network construction with backpropagation. In Advances in Neural Information Processing Systems 1, D. S. Touretzky, ed., pp. 177–185. Morgan Kaufmann, San Mateo, CA. Hassibi, B., and Stork, D. G. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 164–171. Morgan Kaufmann, San Mateo, CA. Hastie, T. J., and Tibshirani, R. J. 1990. Generalized additive models. Monographs on Statisics and Applied Probability 43. Hinton, G. E., and van Camp, D. 1993. Keeping neural networks simple. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pp. 11–18. Springer-Verlag, Berlin. Hochreiter, S. and Schmidhuber, J. 1994. Flat minimum search finds simple nets. Tech. Rep. FKI-200-94, Fakult¨at fur ¨ Informatik, Technische Universit¨at Mun¨ chen. Hochreiter, S., and Schmidhuber, J. 1995. Simplifying nets by discovering flat minima. In Advances in Neural Information Processing Systems 7, G. Tesauro, D. S. Touretzky, and T. K. Leen, eds., pp. 529–536. MIT Press, Cambridge MA. Holden, S. B. 1994. On the theory of generalization and self-structuring in linearly weighted connectionist networks. Ph.D. thesis, Cambridge University. Kerlirzin, P., and Vallet, F. 1993. Robustness in multilayer perceptrons. Neural Computation 5(1), 473–482. Krogh, A., and Hertz, J. A. 1992. A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippman, eds., pp. 950–957. Morgan Kaufmann, San Mateo, CA. Kullback, S. 1959. Statistics and Information Theory. John Wiley, New York. LeCun, Y., Denker, J. S., and Solla, S. A. 1990. Optimal brain damage. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 598–605. Morgan Kaufmann, San Mateo, CA. Levin, E., Tishby, N., and Solla, S. 1990. A statistical approach to learning and generalization in layered neural networks. Proceedings of the IEEE 78(10), 1568–1574. Levin, A. U., Leen, T. K., and Moody, J. E. 1994. Fast pruning using principal components. In Advances in Neural Information Processing Systems 6, J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 35–42. Morgan Kaufmann, San Mateo, CA. MacKay, D. J. C. 1992a. Bayesian interpolation. Neural Computation 4, 415–447. MacKay, D. J. C. 1992b. A practical Bayesian framework for backprop networks. Neural Computation 4, 448–472. Matsuoka, K. 1992. Noise injection into inputs in back-propagation learning. IEEE Transactions on Systems, Man, and Cybernetics 22(3), 436–440. Minai, A. A., and Williams, R. D. 1994. Perturbation response in feedforward networks. Neural Networks 7(5), 783–796.
Flat Minima
41
Møller, M. F. 1993. Exact calculation of the product of the Hessian matrix of feedforward network error functions and a vector in O(N) time. Tech. Rep. PB-432, Computer Science Department, Aarhus University, Denmark. Moody, J. E. 1989. Fast learning in multi-resolution hierarchies. In Advances in Neural Information Processing Systems 1, D. S. Touretzky, ed., pp. 29–39. Morgan Kaufmann, San Mateo, CA. Moody, J. E. 1992. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippman, eds., pp. 847–854. Morgan Kaufmann, San Mateo, CA. Moody, J. E., and Utans, J. 1992. Principled architecture selection strategies for neural networks: Application to corporate bond rating prediction. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippman, eds., pp. 847–854. Morgan Kaufmann, San Mateo, CA. Mosteller, F., and Tukey, J. W. 1968. Data analysis, including statistics. In Handbook of Social Psychology, G. Lindzey and E. Aronson, eds., Vol. 2. AddisonWesley, Reading, MA. Mozer, M. C., and Smolensky, P. 1989. Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in Neural Information Processing Systems 1, D. S. Touretzky, ed., pp. 107–115. Morgan Kaufmann, San Mateo, CA. Murray, A. F., and Edwards, P. J. 1993. Synaptic weight noise during MLP learning enhances fault-tolerance, generalisation and learning trajectory. In Advances in Neural Information Processing Systems 5, S. J. Hanson, J. D. Cowan, and C. L. Giles, eds., pp. 491–498. Morgan Kaufmann, San Mateo, CA. Neti, C., Schneider, M. H., and Young, E. D. 1992. Maximally fault tolerant neural networks. In IEEE Transactions on Neural Networks 3, pp. 14–23. Nowlan, S. J., and Hinton, G. E. 1992. Simplifying neural networks by soft weight sharing. Neural Computation 4, 173–193. Opper, M., and Haussler, D. 1991. Calculation of the learning curve of Bayes optimal classification algorithm for learning a perceptron with noise. In Computational Learning Theory: Proceedings of the Fourth Annual Workshop. Morgan Kaufmann, San Mateo, CA. Pearlmutter, B. A. 1994. Fast exact multiplication by the Hessian. Neural Computation 6(1), 147–160. Pearlmutter, B. A., and Rosenfeld, R. 1991. Chaitin-Kolmogorov complexity and generalization in neural networks. In Advances in Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 925– 931. Morgan Kaufmann, San Mateo, CA. Refenes, A. N., Francis, G., and Zapranis, A. D. 1994. Stock performance modeling using neural networks: A comparative study with regression models. Neural Networks 7, 375–388. ¨ Rehkugler, H., and Poddig, T. 1990. Statistische Methoden versus Kunstliche Neuronale Netzwerke zur Aktienkursprognose. Tech. Rep. 73, University Bamberg, Fakult¨at Sozial- und Wirtschaftswissenschaften. Rissanen, J. 1978. Modeling by shortest data description. Automatica 14, 465–471.
42
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Schmidhuber, J. H. 1994a. Discovering problem solutions with low Kolmogorov complexity and high generalization capability. Tech. Rep. FKI-194-94, Fakult¨at fur ¨ Informatik, Technische Universit¨at Munchen. ¨ Short version in Machine Learning: Proceedings of the Twelfth International Conference, A. Prieditis and S. Russell, eds., pp. 488-496, Morgan Kaufmann, San Mateo, 1995. Schmidhuber, J. H. 1994b. On learning how to learn learning strategies. Tech. Rep. FKI-198-94, Fakult¨at fur ¨ Informatik, Technische Universit¨at Munchen. ¨ Schmidhuber, J. 1996. A general method for multi-agent learning and incremental self-improvement in unrestricted environments. In Evolutionary Computation: Theory and Applications, X. Yao, ed., Scientific Publ. Co., Singapore. Shannon, C. E. 1948. A mathematical theory of communication (parts I and II). Bell System Technical Journal 27, 379–423. Stone, M. 1974. Cross-validatory choice and assessment of statistical predictions. Roy. Stat. Soc. 36, 111–147. Vapnik, V. 1992. Principles of risk minimization for learning theory. In Advances in Neural Information Processing Systems 4, J. E. Moody, S. J. Hanson, and R. P. Lippman, eds., pp. 831–838. Morgan Kaufmann, San Mateo, CA. Wallace, C. S., and Boulton, D. M. 1968. An information theoretic measure for classification. Computer Journal 11(2), 185–194. Wang, C., Venkatesh, S. S., and Judd, J. S. 1994. Optimal stopping and effective machine complexity in learning. In Advances in Neural Information Processing Systems 6, J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 303–310. Morgan Kaufmann, San Mateo, CA. Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. 1991. Generalization by weight-elimination with application to forecasting. In Advances in Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 875–882. Morgan Kaufmann, San Mateo, CA. White, H. 1989. Learning in artificial neural networks: A statistical perspective. Neural Computation 1(4), 425–464. Williams, P. M. 1994. Bayesian regularisation and pruning using a Laplace prior. Tech. Rep., School of Cognitive and Computing Sciences, University of Sussex, Falmer, Brighton. Wolpert, D. H. 1994a. Bayesian backpropagation over i-o functions rather than weights. In Advances in Neural Information Processing Systems 6, J. D. Cowan, G. Tesauro, and J. Alspector, eds., pp. 200–207. Morgan Kaufmann, San Mateo, CA. Wolpert, D. H. 1994b. The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework. Tech. Rep. SFI-TR-03-123, Santa Fe Institute, NM.
Received January 5, 1995; accepted March 19, 1996.
NOTE
Communicated by Raphael Ritz
Lyapunov Functions for Neural Nets with Nondifferentiable Input-Output Characteristics Jianfeng Feng Mathematisches Institut, University of Munich, D-80333 Munich, Germany and Laboratory of Biomathematics, The Babraham Institute, Cambridge CB2 4AT, UK
I construct Lyapunov functions for asynchronous dynamics and synchronous dynamics of neural networks with nondifferentiable input-output characteristics.
1 Introduction A central concern in designing a parallel, distributed neural network is the global stability of the collective network state when many parts of the system are free to change their state simultaneously (Albeverio et al. 1995; Golden 1993; Goles and Martinez 1990; Grossberg 1988; Hopfield 1982, 1984; Herz and Marcus 1993; Marcus and Westervelt 1989). As an example, attractor networks configured to store fixed patterns may be globally stable when the states of single “neurons” are updated one at a time—forgoing parallelism altogether—but may possess a large number of spurious oscillatory modes when all neurons are updated in parallel. In particular, all of the strong global stability results for sequential or distributed updating are constrained by the requirement that the input-output characteristic f of a network is an increasing differentiable sigmoid function. However, there are many models of neural networks using a nondifferentiable characteristic f . Such models include Linsker’s model and the Brain-State-in-a-Box (BSB) model (Feng et al. 1995; Feng et al. 1996; Feng and Tirozzi 1995, 1996; Hui and Zak 1992; Linsker 1986). This note reports that a Lyapunov function for neural networks with asynchronous or synchronous dynamics but nondifferentiable input-output characteristics takes the same form as in the differentiable characteristic case. The main idea in my rigorous proof is that instead of employing the Taylor expansion, which requires the differentiability of the input-output characteristic, I use the Legendre-Fenchel transformation (Goles and Martinez 1990). The conclusions for Lyapunov functions can be extended to the same model with distributed dynamics (Herz and Marcus 1993; Marcus and Westervelt 1989). Neural Computation 9, 43–49 (1997)
c 1997 Massachusetts Institute of Technology °
44
Jianfeng Feng
2 Asynchronous Dynamics An asynchronous dynamics is a composition of two dynamics. First we P select a neuron i from (1, . . . , N) with probability pi > 0, i pi = 1 and the state Xi (t) of the ith neuron at time t is updated to the new state according to N X aij rj Xj (t) + bi Xi (t + 1) = f j=1
but keep all other states unchanged: Xj (t + 1) = Xj (t), j 6= i. Here A = (aij , i, j = 1, . . . , N) is a symmetric N × N matrix, ri ≥ 0, i = 1, . . . , N and f (x) a continuous function that is strictly increasing if α ≤ x ≤ β and f (x) = α if x ≤ α, f (x) = β if x ≥ β. Note that f is not differentiable, and this is the difficulty for proving the global stability of the asynchronous dynamics. Without loss of generality we suppose that α < 0 = f (0) < β. A Lyapunov function for the stochastic process X(t) = (Xi (t), i = 1, . . . , N) is called a supermartingale (Chow and Teicher 1988), defined as follows: Definition 1.
A stochastic process L(X(t)) is called a supermartingale if
E(L(X(t + 1)) | Ft ) ≤ L(X(t))
(2.1)
where L is a measurable function, Ft = σ (X(1), . . . , X(t)) the sigma algebra generated by X(s), s = 1, . . . , t and E(· | Ft ) is the conditional expectation with respect to the sigma algebra Ft . The meaning of the definition of a supermartingale is clear. In average, that is, E(L(X(t + 1)) | Ft ) the function L decreases with respect to its dynamic evolution. Certainly when the dynamics is deterministic, equation 2.1 reduces to the condition of a usual Lyapunov function L(X(t + 1)) ≤ L(X(t)). Define L(X(t)) =
N Z X j=1
Xj (t) 0
rj f −1 (y) dy −
N N X 1X aji Xj (t)Xi (t)rj ri − rj bj Xj (t). 2 j,i=1 j=1
Notice that when f is differentiable, L coincides with the Lyapunov function used by Marcus and Westervelt (1989) to study the time evolution of iterated map networks, with the function in Herz and Marcus (1993) for distributed dynamics, and with Hopfield’s energy function for binary neurons and continuous neurons (Feng and Tirozzi 1995; Grossberg 1988; Hopfield 1982, 1984). Here another question arises: Does the asynchronous dynamics reach the set of attractors within finite time? In other words, is the system
Lyapunov Functions for Neural Nets
45
dissipative? For this purpose we introduce τ (²) := inf{t : ρ(X(t), A) ≤ ²}, which is the first time for the process X(t) to enter the ²-neighborhood of the attractors A of the asynchronous dynamics X(t), where ρ is the Euclidean distance on RN . Theorem 1. L(X(t)) is a supermartingale and ∀² > 0, τ (²) < ∞ provided that aii ≥ 0, i = 1, . . . , N. Proof. In terms of the definition of the asynchronous dynamics, which only changes state of neuron i with probability pi at a time step, we have E(L(X(t + 1)) | Ft ) =
1 1X aij ri rj Xi (t)Xj (t) − akk rk2 Xk2 (t + 1) 2 2 k=1 i,j6=k X Z Xi (t) X ri bi Xi (t) + ri f −1 (y) dy − N X
pk −
i6=k
i6=k
0
− bk rk Xk (t + 1) N X − aik ri rk Xk (t + 1)Xi (t) i=1 i6=k
Z
#
Xk (t+1)
+
rk f
0
−1
(y) dy .
(2.2)
Therefore E(L(X(t + 1)) | Ft ) − L(X(t)) N N X 1 X pk − ajk rj rk (Xk (t + 1) − Xk (t))Xj (t) − rk2 akk (Xk2 (t + 1) − Xk2 (t)) = 2 j=1 k=1 j6=k
Z
Xk (t+1)
+ 0
rk f −1 (y) dy −
Z 0
Xk (t)
rk f −1 (y) dy − bk rk Xk (t + 1) + bk rk Xk (t)
N N X X 1 = pk − ajk rj rk (Xk (t + 1) − Xk (t))Xj (t) − rk2 akk (Xk (t + 1) − Xk (t))2 2 j=1 k=1 Z Xk (t+1) Z Xk (t) −1 −1 + rk f (y) dy − rk f (y) dy − bk rk Xk (t + 1) + bk rk Xk (t) . 0
0
46
Jianfeng Feng
Figure 1: An explanation of the Legendre-Fenchel transformation. It is easily RX RY RX R f (X) −1 f (y) dy. seen that XY = 0 f (x) dx + 0 f −1 (y) dy = 0 f (x) dx + 0
In terms of the Legendre-Fenchel transformation, which in our case is fairly straightforward (see Fig. 1), Z
Xk (t+1)
f
−1
Z (y) dy +
0
uk (t)
0
f (x) dx = uk (t)Xk (t + 1),
t ≥ 0,
where uk (t) :=
N X
akj rj rk Xj (t) + bk
j=1
we have that E(L(X(t + 1)) | Ft ) − L(X(t)) " Z Z N uk (t) X p k rk − f (x) dx + ≤ k=1
0
0
uk (t−1)
# f (x) dx + Xk (t)(uk (t) − uk (t − 1))
≤ 0 since the function f is a strictly increasing function. So the function L(X(t))
Lyapunov Functions for Neural Nets
47
is a Lyapunov function (supermartingale) of the dynamics X(t). Next we consider the second conclusion of the theorem. As long as X(t) 6∈ S² := {x; ρ(x, A) ≤ ², x ∈ RN }, g(X(t)) := E(L(X(t + 1) | Ft )) − L(X(t)) < 0. Hence there is a δ = δ(²) > 0 such that g(X(t)) < −δ for X(t) ∈ [α, β]N \ S² . Therefore the process M(t) = L(X(t)) + tδ is also a bounded supermartingale. According to Doob’s (Chow and Teicher 1988) theorem, M(τ ∧ t) is again a bounded supermartingale. From the convergence theorem for supermartingales (Chow and Teicher 1988), it follows that lim M(τ ∧ t) = M < ∞
t→∞
a.s.
Thus we conclude that lim M(τ ∧ t) = lim (L(X(τ ∧ t)) + δ · (τ ∧ t)) < ∞
t→∞
t→∞
a.s.
Note that L(X(τ ∧ t)) is itself a bounded supermartingale; therefore, limt→∞ (τ ∧ t) < ∞ a.s. which implies τ < ∞ a.s. 3 Synchronous Dynamics Consider the following synchronous dynamics: Y(t + 1) = F(AR(Y(t))0 + B), t = 0, . . . ,
(3.1)
where Y(t) = (Yi (t), i = 1, . . . , N) ∈ RN for (·)0 representing the transpose of a vector, R = (ri δij ) for δij = 1 if i = j and δij = 0 if i 6= j, ri ≥ 0, i, j = 1, . . . , N, B = (bi , i = 1, . . . , N) ∈ RN and F = ( f, . . . , f ). As one may expect the behavior of the synchronous dynamics defined by equation 3.1 is substantially different from that of the asynchronous dynamics. In the case of asynchronous dynamics, we know from Theorem 1 that the set of attractors of the dynamics are all fixed-point attractors. Here, however, in addition to stable fixed points, there are attractors of two-state limit cycles for the synchronous dynamics, which is the same situation as in the Hopfield model. Due to the compactness of the state space of the dynamics (equation 3.1) we assert that the synchronous dynamics (equation 3.1) is dissipative. The proof of the following theorem is omitted since it is similar to that of Theorem 1.
48
Jianfeng Feng
Theorem 2.
The function
V(Y(t)) = −
X
aij ri rj Yi (t)Yj (t + 1) −
i,j
+
XZ i
0
Yi (t)
ri f −1 (u) du +
X
ri bi (Yi (t) + Yi (t + 1))
XZ i
Yi (t+1)
ri f −1 (u) du
0
is a Lyapunov function of the synchronous dynamics. The set of all fixed-point attractors of the asynchronous or synchronous dynamics can be divided into two categories: saturated attractors and unsaturated attractors. It is reported that saturated attractors are most likely to be observed in numerical simulations in Linsker’s model and the BSB model and are investigated in Feng et al. (1995, 1996). 4 Conclusions I construct Lyapunov functions for a class of asynchronous and synchronous dynamics, which include many models in neural networks as a special case. For the two most commonly utilized dynamics of neural networks with discrete time, asynchronous and synchronous dynamics, my results indicate that in general it is impossible to define a Lyapunov function of the model under consideration depending on only one state. For the asynchronous dynamics, a restriction on the matrix or on the interaction between neurons is needed for the existence of a Lyapunov function, while there may exist two-state limit cycles for the synchronous dynamics. This note constructs Lyapunov functions of the asynchronous and synchronous dynamics with nondifferentiable characteristics and may help our understanding of the dynamic properties of related models. Acknowledgments This article was partially supported by the A. von Humboldt Foundation of Germany. I thank an anonymous referee for bringing the references Herz and Marcus (1993) and Marcus and Westervelt (1989) to my attention. References Albeverio, S., Feng, J., and Qian, M. 1995. Role of noises in neural networks. Phys. Rev. E. 52, 6593–6606. Chow, S. Y., and Teicher, H. 1988. Probability Theory. Springer-Verlag, New York. Feng, J., Pan, H., and Roychowdhury, V. P. 1995. A rigorous analysis of Linskertype Hebbian learning. In Advances in Neural Information Processing Systems
Lyapunov Functions for Neural Nets
49
7, G. Tesauro, D. S. Touretzky, and T. K. Leen, eds., pp. 319–326. MIT Press, Cambridge, MA. Feng, J., Pan, H., and Roychowdhury, V. P. 1996. On neurodynamics with limiter function and Linsker’s developmental model. Neural Computation 8(5), 1003– 1019. Feng, J., and Tirozzi, B. 1995. The SLLN for the free-energy of the Hopfield and spin glass model. Helvetica Physica Acta 68, 365–379. Feng, J., and Tirozzi, B. 1996. An application of the saturated attractor analysis to three typical models. In Cybernetic and Systems ’96, R. Trappl, ed., pp. 1102– 1107. World Scientific, Singapore. Golden, R. M. 1993. Stability and optimization analyses of the generalized BrainState-in-a-Box neural network model. Journal of Mathematical Psychology 37, 282–298. Goles, E., and Martinez, S. 1990. Neural and Automata Networks. Kluwer, Dordrecht. Grossberg, S. 1988. Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks 1, 17–61. Herz, A. V. M., and Marcus, C. M. 1993. Distributed dynamics in neural networks. Phys. Rev. E 47, 2155–2161. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA 79, 2554–2558. Hopfield, J. J. 1984. Neurons with graded response have computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA 81, 3088–3092. Hui, S., and Zak, S. H. 1992. Dynamical analysis of the Brain-State-in-a-Box (BSB) neural models. IEEE Transactions on Neural Networks 3, 86–94. Linsker, R. 1986. From basic network principle to neural architecture (series). Proc. Natl. Acad. Sci. USA 83, 7508–7512, 8390–8394, 8779–8783. Marcus, C. M., and Westervelt, R. M. 1989. Dynamics of iterated map networks. Phys. Rev. A 40, 501–504.
Received October 23, 1995; accepted April 9, 1996.
Communicated by George Gerstein
Detecting Synchronous Cell Assemblies with Limited Data and Overlapping Assemblies Gary Strangman Department of Cognitive and Linguistic Sciences, Brown University, Providence, RI 02912 USA
Two statistical methods—cross-correlation (Moore et al. 1966) and gravity clustering (Gerstein et al. 1985)—were evaluated for their ability to detect synchronous cell assemblies from simulated spike train data. The two methods were first analyzed for their temporal sensitivity to synchronous cell assemblies. The presented approach places a lower bound on the amount of data required to detect a synchronous assembly. On average, both methods required the same minimum amount of recording time to detect significant pairwise correlations, but the gravity method exhibited less variance in the recording time. The precise length of recording depends on the consistency with which a neuron fires synchronously with the assembly but was independent of the assembly firing rate. Next, the statistical methods were tested with respect to their ability to differentiate two distinct assemblies that overlapped in time and space. Both statistics could adequately differentiate two overlapping synchronous assemblies. For cross-correlation, this ability deteriorates quickly when considering three or more simultaneously active, overlapping assemblies, whereas the gravity method should be more flexible in this regard. The work demonstrates the difficulty of detecting assembly phenomena from simultaneous neuronal recordings. Other statistical methods and the detection of other types of assemblies are also discussed.
1 Introduction Although electrophysiological recordings have been a part of neuroscience research for over a century (Caton 1875), only in the past 20 years has there been considerable technological growth in multiple simultaneous electrode recordings (Pickard and Welberry 1976; Gross et al. 1977; McNaughton et al. 1983; Wilson and McNaughton 1993; Nicolelis et al. 1993; Nordhausen et al. 1994). Recently it has even become feasible to record individual spike trains from over 100 neurons simultaneously (e.g., Wilson and McNaughton 1993; Falk et al. 1993). Such technological advances do more than simply facilitate data collection. With the data from many simultaneously recorded neurons, one can begin to ask so-called functional questions about the brain. Neural Computation 9, 51–76 (1997)
c 1997 Massachusetts Institute of Technology °
52
Gary Strangman
One critical functional question stems from Barlow (1972, 1992), who postulates that each cell is an independent representational unit. In contrast with this view, several other theorists (including James 1890; Hebb 1949; Palm 1982; Edelman 1987) have predicted that neurons depend on one another for representation; that is, their responses are organized into functional assemblies. Evidence is beginning to surface in favor of the latter hypothesis—that is, assembly-based neuronal organization. When Georgopoulos et al. (1986) recorded from monkeys performing reaching tasks, they found data consistent with arm movement direction being encoded across populations of direction-selective neurons. Using information theory, Gochin et al. (1994) provide evidence that neurons in inferotemporal cortex also use some type of distributed (i.e., assembly-based) coding scheme, in this case for visually presented shapes. Data in Nicolelis et al. (1993) suggest that distributed representations exist for whiskers in the rat somatosensory thalamus as well. And it has even been suggested that neural codes in monkey frontal cortex can be found only by investigating synchrony across ensembles of neurons (Vaadia et al. 1995). Though these results support the existence of cell assemblies, in almost all cases it remains to be demonstrated precisely how groups of neurons work together to form these functional units. That is, what particular type of distributed code or cell assembly is implemented in the brain? Although this question is unresolved, many hypotheses have been introduced. Perhaps the most straightforward distributed code is synchrony, where neurons of an assembly fire synchronized volleys of spikes. In this case, assembly membership is revealed as temporally correlated neural discharges. It is worth dwelling briefly on the synchronous assembly hypothesis because such assemblies will form the basis for the analysis presented in this chapter. The simplest possible synchronous cell assembly is one where all assembly-cell spike trains are identical; that is, every neuron in the assembly fires at precisely the same time as every other neuron in the assembly. For a slightly more general synchronous cell assembly, relax the temporal precision constraint somewhat and thus introduce temporal jitter into otherwise identical spike trains. An even more general definition also relaxes the identity constraint. Thus, individual neurons are allowed to discharge at times when the assembly as a whole is silent (e.g., random background spikes in the spike trains) and/or neurons can “skip” some proportion of assembly discharge times (a measure of consistency). As an alternative to synchronous assemblies, it is possible that phaselocked oscillations in neuronal firing could bind active neurons together into assemblies. Eckhorn et al. (1988) and Gray et al. (1989) have suggested this, among others. More complex assembly definitions are also possible, including lateral inhibition networks, local feedback loops (Kubie 1930), attractor networks (e.g., Anderson et al. 1977; Hopfield 1982), or other spatiotemporal activation patterns (e.g., synfire chains; Abeles 1982, 1991).
Detecting Synchronous Cell Assemblies
53
Assuming that one has hypothesized a particular assembly type and provided that adequate numbers of neurons have been sampled (Strangman 1996), the following question remains: What are the appropriate analytical tools to test whether the hypothesized assembly is present in the data? Although many tools exist, few have been evaluated for their abilities to find assembly phenomena. The goal of this paper is to evaluate two statistics toward this end. Before selecting such tools, however, an assembly type must be established because the assembly type determines the relevant parameters for statistical analysis. Synchronous assemblies were the chosen type of functional assembly for two main reasons: (1) synchrony is commonly suggested as an important parameter of brain function, and (2) synchronous assemblies are one of the few types of functional units that require no additional a priori knowledge about the neurons. Assembly hypotheses like local feedback loops, miniature classifier networks, cellular automata, or even simple lateral inhibition all require some knowledge about the role of each recorded neuron in the computation (inhibiting versus inhibited, etc.) in order to classify correctly neurons as belonging to one or another active assembly. Any analysis of this sort requires single-unit spike train data known a priori to contain an assembly. This precludes the use of electrophysiological data, and therefore the required data were generated by a computer model. The synchronous assemblies (defined earlier) were modeled with the assembly jitter controlled by a parameter ε and the firing consistency controlled by the parameter κ (see Section 2.1). These simulated spike trains then formed the input to the statistical methods. The goal of this work was to compare two different statistical techniques for their abilities to detect assembly phenomena. Cross-correlation (Moore et al. 1966) and the gravity method (Gerstein et al. 1985) were chosen because synchronous assemblies are defined by coincident spiking, and these two statistics are fundamentally coincidence detectors. The two methods were evaluated for their temporal sensitivity to assembly activity and for their ability to differentiate two simultaneously active assemblies present in the same recorded region. 2 Simulation Data and Analysis Procedures 2.1 Simulated Synchronous Assemblies. Eighty spike trains were generated each time cell assembly data were needed. The simulated neurons were given properties that roughly correspond to pyramidal cells of motor cortex, though the results are not specific to any particular brain region. As will be shown, the implemented firing rates are immaterial to the analysis. The generation method is similar to that of Aertsen and Gerstein (1985). Each binary unit fired at an average background rate randomly selected between 4 and 10 Hz. The precise firing times were modeled as independent Poisson processes (equal probability of a spike at every time step) modified
54
Gary Strangman
A
B
C
D
Figure 1: Illustration of the spike train generation process for assembly units. A source spike train is generated (A), and then spikes are selected with a probability κ. Spike train (B) shows one possible result of this process for κ = 0.4. These spikes are then added to the spike train generated by the background firing rate of a unit (C), resulting in the final spike train (D). Any jitter would be incorporated after creating spike train (B) by moving each spike to the left or right by no more than ε msec.
by a 3 msec absolute refractory period following each spike. This completely describes the nature of spike trains for units uninvolved in assembly activity. For simulated units participating in assembly activity, an additional procedure was implemented (see Fig. 1). First, a source spike train was generated. This spike train also had a Poisson distribution of spike times and a 3 msec absolute refractory period but fired at an average rate, %0 , between 150 Hz and 200 Hz (randomly selected for each assembly). During assembly activity, each source spike was added to a given unit’s spike train with a probability κ. After the addition, jitter was introduced by moving each added spike slightly ahead or behind its original time of occurrence, t. This was done by adding a random value in the range [−ε, ε] to the original source spike time t, and thus ε controlled the amount of jitter in the synchrony. Finally, any spikes occurring during the 3 msec absolute refractory period of another spike were eliminated as a cleanup process. This results in a synchronous assembly unit firing at an average rate of κ%0 Hz plus the unit’s background firing rate. A high source-unit firing rate was used so that κ times this rate was still considerably larger than the 4–10 Hz base firing rates even when κ = 0.2. At values of κ near 1, the units fire at unrealistically high rates (up to 150–200 Hz). However, as will be reported in Section 3, (and derived in Appendix A), the minimum required recording time is—somewhat unintuitively—independent of the assembly firing rate, making the realism of these rates immaterial.
Detecting Synchronous Cell Assemblies
55
When two assemblies were needed for analysis, two independent source spike trains were computed. Units participating in both assemblies (50% of the simulated neurons) fired to the proportion κ of both assemblies. This was accomplished by applying the addition, jitter, and cleanup process twice— one time for each source spike train. All resulting spike trains contained additional internal structure only as occurred by chance. Again, 80 spike trains were generated per simulation, and the simulation time step was set at 0.2 msec. Data generation was performed on a VAX 6000-510 (Digital Equipment Corporation). Since the modeled spike trains depended heavily on the builtin random number generator, a battery of 10 statistical tests (Knuth 1981) was performed to ensure adequate randomness. All tests found the generator not significantly different from purely random (Students t-test: p > 0.2 for all ten tests). 2.2 Statistical Methods. Many statistics have been developed for multiple spike train analysis. Many of these analyze pairs of neurons (e.g., Moore et al. 1966; Gerstein and Perkel 1969; Gerstein et al. 1989; Gochin et al. 1989; Aertsen et al. 1989; Aertsen and Gerstein 1991; Yamada et al. 1993). Some techniques compare trios of neurons (e.g., Perkel et al. 1975; Abeles et al. 1994) or compare all recorded neurons simultaneously (e.g., Gerstein et al. 1978, 1985; Aertsen et al. 1987; Bernard et al. 1993; Gochin et al. 1994; Nicolelis and Chapin 1994; Abeles et al. 1995). In this article, two statistical methods were chosen for comparative analysis. The first method selected was cross-correlation (Moore et al. 1966), which compares pairs of spike trains. If the two spike trains are considered as functions f and g then the crosscorrelation of these two functions can be computed as follows: Z ∞ f (t + λ)g(t) dt Corλ ( f, g) = −∞
where λ is called the lag of function f relative to function g and can be positive or negative. A cross-correlation histogram (or cross-correlogram, CCH) is made by using discrete function evaluation points, selecting a histogram bin width, and summing (instead of integrating) the product of f (t + λ)g(t) into the bin covering the point λ. This process is repeated for all interesting values of λ. Cross-correlation was selected mainly because of its universal use in spike-train analysis. However, this method is also important because nearly all pairwise statistics are based on correlation and because rigorous study has provided it with adequate significance tests of the resulting cross-correlation histogram peaks (e.g., Perkel et al. 1967; Gerstein and Perkel 1972; Gochin et al. 1989). Computationally, each cross-correlogram requires O(N2 ) units of computing time (N being the number of spike trains analyzed), times the number of histograms that must be calculated, N(N − 1)/2, or O(N2 ). This makes the cross-correlation computations overall O(N4 ).
56
Gary Strangman
The second method selected was the gravity method (Gerstein et al. 1985), which analyzes all N simultaneously recorded neurons at once. This is accomplished in an N-dimensional space, where particles, one for each recorded neuron, originate at the corners of an N-dimensional hypercube (i.e., all N particles are initially equidistant). These particles then interact “gravitationally,” where the attractive force between two particles is proportional to the “charge” on the particles. A particle’s charge is instantaneously incremented whenever the neuron corresponding to that particle fires a spike, and the charge decays exponentially over time as controlled by a parameter, τ . Thus, for neurons that fire roughly synchronously, their corresponding particles will be charged simultaneously and therefore will attract one another. The result is a clustering of particles in the N-dimensional space, where clusters correspond to those groups of neurons that were simultaneously active. The method is adjusted to make the particle aggregation independent of average firing rate, to prevent net aggregation among independently firing neurons, and to eliminate attractive forces when particles reached a minimum distance. Such modifications are all described in Gerstein et al. (1985). The precise equations were as follows. For particle i at position xi (t) at time t, the position in the N-dimensional space at time t + h is calculated as xi (t + h) = xi (t) + hσg [qi (t) − τ ]fi (t), where σg is the aggregation parameter, qi (t) is the charge on particle i at time t, τ is the average charge on all particles (equal to the charge-decay constant), and fi (t) is the force on particle i at time t. This force is given by X [qj (t) − τ ] ri (j), fi (t) = j6=i
where ri (j) is the direction of the force on neuron i produced by neuron j and is given by the following ri (j) = [xj (t) − xi (t)]/sij (t), with sij (t) being the Euclidean distance between particles i and j at time t. Finally, the charge increment for neuron i, 1qi , had to be normalized for different mean firing rates among the various neurons. This is accomplished by making 1qi directly proportional to some measure of neuron i’s interspike interval (ISI). Two measures were used: (1) 1qi was set equal to the cumulative mean ISI (as in Gerstein et al. 1985), or (2) 1qi was set equal to the most recent ISI. Gravity analyses were selected as the comparison statistic because of its unique representations (spatial instead of histogram based) and because it is designed to analyze an arbitrary number of spike trains simultaneously. Computationally, the gravity method requires O(N3 ) units of computing time and, like the mass of correlograms above, requires further analysis
Detecting Synchronous Cell Assemblies
57
of the clustered particles to determine the precise nature of any recorded assembly. 2.3 Temporal Sensitivity Analyses. To evaluate the temporal sensitivity of cross-correlation and the gravity method, a synchronous assembly with zero jitter (ε = 0 msec) was generated and then analyzed. Zero jitter results in perfectly synchronous assembly-unit spikes, which are easiest for the methods to detect. Thus, the analysis should place a lower bound on the temporal capabilities of each method at each level of consistency, κ. Three seconds of data were simulated for 80 spike trains, 60 of which formed a single synchronous assembly. This was done for each of five values of κ. The resulting data were first analyzed by cross-correlations using successively longer portions of the synchronous data as input. The lengths of data analyzed were 25, 50, 100, 200, 300, . . . , 3000 msec. Twenty pairs of spike trains were selected at random for cross-correlation analysis, each pair analyzed using each length of data (up to a maximum of 32 correlograms per pair of neurons, or 6400 correlograms total). Histograms were made using a 2 msec bin width and were tested for a significant central peak using the smoothed-difference cross-correlogram (Gochin et al. 1989) at the 99% confidence level. This particular test avoids the large bin variance generated by subtracting the standard shift predictor histogram (Perkel et al. 1967). When the synchronous peak remained statistically significant for three consecutive data lengths, the time required to obtain the first significant peak was used as the “empirical” cross-correlation minimum amount of recording time to reach significance. This represented the minimum recording time required to detect significant assembly membership, Tmin . The temporal sensitivity analysis of the gravity method was performed on the same simulated data. The analyzed particle pairs were the same 20 neuron pairs used in the cross-correlograms, and the Euclidean distance between pairs of particles in the N-dimensional space was measured. In the absence of any rigorous statistical test for particle clusters, gravity Tmin was simply defined as the time it took for the particles to approach within five units of one another, given a starting position 100 units apart (with an absolute minimum particle separation of one unit). Implementing the gravity method requires setting two parameter values: the aggregation parameter, σg , and the charge decay constant, τ . The value of τ is not critical as long as it is kept small relative to the average interspike interval. However, false-positive or false-negative clustering can result from too large or too small a σg , respectively. Since there was prior knowledge available about which simulated neurons were members of the assembly, σg was optimized to afford maximal aggregation speed (i.e., to minimize recording time without giving false-positive clustering). The optimal value for these analyses was σg = 8.0 × 10−6 . In the usual experimental situation, lacking a priori knowledge of which cells are assembly cells, σg , can be a liability of the gravity method. Finally, 1qi was set to the cumulative mean ISI,
58
Gary Strangman
because this reduces the noise in the resulting interparticle distance plots. This normalization exploits the change in firing rate when assemblies turn on, but it also amplifies the clustering effects of synchrony, maximizing the speed of low-noise clustering (for further discussions about normalization, see Gerstein and Aertsen 1985). Since the method of evaluating significant assembly correlation differed dramatically between the two methods (smoothed-difference cross-correlogram significance versus a Euclidean distance of five units), there was no expectation that the absolute values of cross-correlation and gravity Tmin would be directly comparable. Planned comparisons therefore included dependence on firing consistency, κ, and the relative variability of the two methods. 2.4 Analysis of Two Assemblies. Cross-correlation and gravity clustering were also compared for their ability to detect and separate units from two active synchronous assemblies. Each statistic had to classify correctly the simulated neurons according to their assembly of origin. Two overlapping synchronous assemblies were simulated, again with 80 spike trains. Spike trains for units 1–9 and 71–80 fired at their base rates, 10–19 fired at base rates plus spikes for the first assembly, 61–70 fired at base rates plus spikes for the second assembly, and 20–60 fired at base rates plus spikes from both assemblies. Jitter for both assemblies was set at ±1 msec (i.e., ε = 1 msec). Cross-correlation and gravity methods were implemented on the entire overlapping assembly data set, and the methods were compared with respect to assembly detection and neuron classification errors. Parameter settings were as follows: unit consistency, κ = 0.4; the cross-correlogram bin width, 1 = 2 msec (i.e., 2ε); the gravity coalescence parameter (σg ) was 8.0 × 10−5 ; and the decay parameter (τ ) was set at 2 msec. The value of 1qi was set equal to the most recent ISI in neuron i to make the method most sensitive to synchrony and less so to changes in unit firing rates. 3 Results 3.1 Theoretical Temporal Sensitivity. The temporal sensitivity of cross-correlation to synchronous assemblies can be formally derived in a fashion parallel to Aertsen and Gerstein’s (1985) analysis of effective connectivity. (Refer to Appendix A for the complete formal analysis.) The result is an expression for the minimum amount of recording time required to detect significant pairwise spike-time correlation, Tmin : Tmin =
41 κ2
(or Tmin =
4σ 2 , if σ > 1), κ 21
(3.1)
where 1 is the cross-correlogram bin width, κ is the consistency with which the neurons fire in synchrony with the assembly, and σ is the width of
Detecting Synchronous Cell Assemblies
59
Table 1: Comparison of Minimum Recording Time Required to Detect a Perfectly Synchronous Assembly Using Cross-Correlation and Gravity Methods.
κ
Theoretical CCH Tmin (msec)
0.2 0.4 0.6 0.8 1.0
200 50 22 13 8
Simulated data: a Cross-Correlation Tmin (msec) Mean ± s.d. Range 95% limit 760 ± 400 200 ± 120 91 ± 100 51 ± 23 28 ± 8
100–1800 25–400 25–400 25–100 25–50
1200 400 300 100 50
Simulated data: b Gravity Tmin (msec) Mean ± s.d. Range 95% limit 760 ± 170 194 ± 52 97 ± 36 53 ± 27 25 ± 6
364–946 73–330 32–152 23–99 13–32
907 247 152 99 32
bin width, 1, was 2 msec. parameters: σg = 8.0 × 10−6 , τ = 2 msec.
a Correlogram b Gravity
the synchrony window (equal to 2ε). The derivation assumes only that a neuron’s firing rate when the assembly is active is much larger than background firing rates. (If this assumption is relaxed, the theoretical value of Tmin increases. Thus, Tmin truly represents the minimum amount of recording time.) From equation 3.1, the theoretical minimum recording time required to detect synchronous assembly correlation depends on 1 and 1/κ 2 and—perhaps unintuitively—is independent of the assembly firing rate %0 . This independence can be understood as follows: although increased firing rates provide more synchronous events per second, such rates also result in a higher background correlation level (and thus more noise). It turns out that these two effects cancel as long as background firing rates are much less than assembly firing rates. However, consistency does matter for minimum recording times; that is, the consistency with which units of an assembly fire synchronously with the assembly strongly influences Tmin . The first two columns of Table 1 show the theoretical Tmin values calculated by equation 3.1 for a range of consistency values (κ), with 1(= σ ) = 2 msec.
3.2 Empirical Temporal Sensitivity. To test these theoretical values, simulated synchronous assemblies were analyzed by both crosscorrelation and gravity methods, using five values of κ. Three typical crosscorrelograms and one gravity interparticle distance plot are shown in Figure 2 for κ = 0.4, all analyzing the same data set where ε = 0 (perfectly synchronous assembly cell spiking). The cross-correlograms show a single pair of simulated units correlated on 300, 400, and 500 msec of synchronized data, respectively. The central peak first becomes significant when the assembly has been active for 400 msec. The gravity plot shows three pairs of nonassembly units (top three traces), three pairs of assembly units (in-
60
Gary Strangman
Figure 2: Temporal sensitivity of cross-correlation and gravity methods to a single synchronous assembly, κ = 0.4. Three cross-correlograms (1 = 2 msec) are shown for a single pair of neurons, using progressively longer portions of the synchronous data (top = 300 msec, middle = 400 msec, bottom = 500 msec; 70 Hz average firing rate). The middle correlogram was the first significant central peak using the significance test in Gochin et al. (1989) at the 99% level. That is, the analyzed pair of simulated spike trains required 400 msec of data before their (known) correlation with the assembly was determined to be significant. (On average, for κ = 0.4, the neurons required approximately 200 msec; this pair had weaker average correlation, due to random variation around κ.) The gravity analysis of the same data is shown for comparison, using three assembly–assembly pairs, three assembly–nonassembly pairs, and three nonassembly–nonassembly pairs of neurons (σg = 8.0 × 10−6 , τ = 2 msec). Though no generalized significance tests have been developed for interparticle distance measurements of the gravity method, the assembly cells are distinctly separable by approximately 100 msec.
Detecting Synchronous Cell Assemblies
61
cluding the pair used for cross-correlations; bottom three traces), and three assembly–nonassembly pairs (middle three traces). The mean Tmin for 20 assembly pairs using cross-correlation was greater than the theoretical value by a constant factor of approximately 4 at all levels of κ (see Table 1, column 3). The factor of 4 is attributed to (1) the significance criterion used (99% plus three consecutive significant correlograms), which is a much more stringent signal detection condition than used in Appendix A, and (2) the fact that the background firing rates were not strictly zero. Importantly, the predicted linear trend of Tmin versus 1/κ 2 was highly significant (relative to no trend; F(1, 95) = 156.9, p ¿ 0.0001). Despite this, a broad range of experimental Tmin ’s was observed, particularly at smaller values of κ (Table 1, column 4). Since the assembly cells were perfectly synchronous (no jitter; ε = 0), only the individual unit 4–10 Hz background firing rates and random variation around κ remained as variables. To determine the minimum recording time for neurophysiological experiments, one is not concerned with the average length of recording required to detect assembly cells; rather, one is interested in the length of recording required to detect, say, 95% of the assembly cells. Such a limit for the assemblies used here is presented in Table 1, column 5. The 95% recording time boundary is typically two standard deviations above the average Tmin at all levels of κ. The same set of spike trains was then analyzed by the gravity method. The resulting gravity Tmin values were almost identical with those obtained for cross-correlation methods, despite the extremely disparate methods for determining significant correlation. No significant differences were found in a two-tailed Wilcoxon t-test, at any level of κ (p > 0.2), suggesting that gravity and cross-correlation are equally sensitive to synchronous assemblies in the temporal realm. Again, the 1/κ 2 trend was highly significant (F(1, 95) = 1099.5, p ¿ 0.0001). It is of note, however, that the variance was less pronounced in the gravity analysis than with cross-correlation. In addition, the 95% limit occurred at somewhat less than two standard deviations above the mean Tmin s. In practice, this means that gravity interparticle distance measurements required shorter recordings to detect 95% of significant assembly neuron correlations (Table 1, column 8). Optimal tuning of the gravity method (not usually possible) and the cumulative average ISI normalization procedure for 1qi (which was somewhat sensitive to the change in unit firing rates) slightly weakens the result that the gravity method is more sensitive to synchronous assemblies. A reasonable conclusion, therefore, is that the gravity method is no less sensitive to synchronous assemblies than cross-correlation. 3.3 Multiple Assemblies. Assuming that some kind of intermediatelevel structures (cell assemblies) exist in the functioning nervous system, are there any consequences of more than one assembly being active at the same time? Position emission tomography and functional magnetic re-
62
Gary Strangman
Figure 3: Ten of the 20 unique spatial arrangements of three cell assemblies. Examples 9 and 10 show two cases where the cells of one assembly (assembly A) must be spatially discontiguous to achieve the given geometric arrangements (i.e., assembly A must have at least two members).
sponse imaging scans have suggested that multiple regions of the cortex are active during overt or imagined behavior (e.g., Roland 1982; Grafton et al. 1991; Rao et al. 1993; Constable et al. 1993). If multiple assemblies are active at the same time, it is possible that multiple assemblies are active at the same time in the same area of cortex (Vaadia et al. 1989). Indeed, many theories of brain function assume precisely this type of representation (e.g., Hebb 1949; Hinton et al. 1986), and in fact any distributed code used by the brain would be extremely inefficient if this were not the case. Thus, distributed neural coding and multiple assemblies together could result in individual neurons being simultaneously activated in association with two or more assemblies. The complications introduced by this possibility are substantial. If two assemblies are simultaneously active in the same general brain area, there are three distinct spatial arrangements: (1) the two assemblies are nonoverlapping (that is, no neuron participates in the activity of both assemblies), (2) the assemblies partially overlap, or (3) the neurons of one assembly form a subset of the neurons of the second assembly. Disambiguation of the two assemblies thus involves classifying four types of neurons: those not involved in either assembly, those involved in assembly A only, those involved in assembly B only, and those associated with both assemblies. If more than two assemblies are simultaneously active, it can be shown (n−1) that the number of possible spatial arrangements increases faster than 22 , where n is the number of simultaneously active assemblies (Appendix B). In the case of three simultaneously active assemblies, there are at least 20 distinct geometrical arrangements to be differentiated (see Fig. 3); four assemblies can have a minimum of 266 distinct arrangements.
Detecting Synchronous Cell Assemblies
63
Assemblies exist in time as well as in space, and thus they can also differ in the temporal sequencing of their activity. Even if an assembly can activate at only three times (before, after, or at roughly the same time as another assembly), there are at least n! +
n−1 µ X i=1
¶ n! (n − i)! (n − i − 1)!(i + 1)!
permutations of assembly onset times. Again n is the number of simultaneously active assemblies. The same number of possibilities would exist for assembly offset times. Thus, for two assemblies (n = 2), there are three possible geometrical arrangements, and 3 × 3 possible temporal (on×off) arrangements. This results in some 27 ways in which the assemblies can interact in the spatiotemporal domain. Three assemblies have 3380 interaction possibilities, while four assemblies can interact in more than 1,266,426 spatiotemporally unique ways. A more finely divided temporal spectrum rapidly expands these possibilities. Thus, multiple active assemblies are capable of vast numbers of interactions. It remains to be determined, however, whether this can cause problems in practice. To test this, cross-correlation and gravity methods were used to separate two partially overlapping (simulated) synchronous assemblies. In one analysis, the two simulated assemblies activated 700 msec apart, while in the second case both assemblies activated simultaneously. Figure 4 shows the analysis results when assembly A began firing at 1000 msec, and assembly B activates at 1700 msec. Both assemblies continued to fire until the end of the simulated data (4000 msec). Cross-correlating a pair of simulated neurons from two different assemblies—using the first 3000 msec of the spike train—shows no significant peak (leftmost column). Crosscorrelating pairs of neurons from the same assembly shows intermediatelevel peaks (around 40 coincident spikes; middle four columns). Crosscorrelating a pair of neurons participating in both assemblies—firing to the proportion κ of the spikes from assembly A plus a similar proportion from assembly B—resulted in a substantially larger peak (approximately 70 coincident spikes; rightmost column). Hence, assuming assembly effects are more or less additive, one can classify all four types of neurons: neuron pairs with the highest peaks belong to the intersection of the two assemblies; medium-height peaks indicate neuron pairs from the same group, and flat correlograms indicate that the two neurons are from different assemblies or that at least one neuron is a member of neither assembly. Given a method for determining relative peak height significance (i.e., whether two significant peaks are different from one another) and careful bookkeeping, one can completely categorize all simulated neurons. Determining the relative temporal onsets of the two assemblies would require scrolling correlograms. The gravity interparticle (Euclidean) distance plots for the same set of simulated spike trains are shown in the third row of Figure 4. Three neu-
64
Gary Strangman
Figure 4: A comparison of cross-correlation and gravity methods when detecting two asynchronously activated cell assemblies. Circles represent the assemblies, with schematic electrodes indicating the region from which a unit is sampled. Assembly A (solid circle) begins firing 700 msec before assembly B (dashed circle). For cross-correlation (second row) 1 = 2 msec, and for the gravity analysis (third row) σg = 8.0 × 10−5 and τ = 2 msec. See text.
ron pairs are plotted in each graph to give a rough characterization of the variance. The first observation is that assembly A–assembly B neuron pairs aggregate substantially (Fig. 4, column 1), so there is no trivial separation of units from different assemblies. This occurs because units of the two assemblies are firing at approximately 70 Hz, and therefore a unit from A fires only within 7.1 msec of any unit in B, on average. Not shown is the fact that pairs of completely independent neurons (firing at 4–10 Hz) remain 100 units apart, while nonassembly–assembly pairs aggregate to approximately 75 units of separation by 4000 msec. This serves to separate pairs of units where at least one has no assembly affiliations. Pairs where at least one neuron is involved with assembly B alone (columns 3 and 5) start aggregating at 1700 msec (when assembly B activates). All other pairs begin aggregating at 1000 msec. By considering the onset times of aggregation and calculating every possible pairwise distance, it is difficult but possible to separate assembly B units as well as units involved in both assemblies (“AB units”). The remaining units are then classified as assembly A units. Finally, notice the aggregation artifact in column 3 at 840 msec. This occurred because the two units of that pair fired within 1 msec of one another and prior to that event had not fired for 630 and 372 msec, respectively. Thus both gained a large charge when they coincidentally fired in near synchrony. Such artifacts are partly a consequence of the rate normalization, whereby neurons that fire infrequently receive large charge increments. Such arti-
Detecting Synchronous Cell Assemblies
65
Figure 5: A comparison of cross-correlation and gravity methods when detecting two synchronously activated cell assemblies. Circles represent the assemblies, with schematic electrodes indicating the region from which a unit is sampled. Both assemblies begin firing at 1000 msec. Again, for cross-correlation 1 = 2 msec and for gravity σg = 8.0 × 10−5 and τ = 2 msec. See text.
facts can be reduced, but not eliminated, by a smaller value for σg (slower aggregation), or a smaller τ (a narrower synchrony window). Figure 5 shows the exact same analysis as in Figure 4 but performed on spike trains wherein assemblies A and B activate simultaneously at 1000 msec. Aside from the small (but significant) central cross-correlation peaks in columns 2 and 3, the results of this analysis are identical to those in Figure 4. This demonstrates cross-correlation’s ability to separate the two assemblies and highlights its insensitivity to assembly onset times. Again, scrolling cross-correlograms could be used, but this would be cumbersome for 80 recorded neurons (3160 cross-correlograms per time step). The gravity analysis was less successful in this situation. First, there is considerable cross-assembly aggregation again (see Fig. 5, column 1), which prevents trivial separation of the two types of assembly units. Second, the analysis was considerably more affected by aggregation artifacts. When the two assemblies first activate, particles in both assemblies are nearsynchronously charged, and these charges are large due to large prior interspike intervals. The result is sudden and dramatic particle aggregation among some assembly neurons (e.g., Fig. 5, columns 3 and 4). Again, one could decrease σg or τ , but this slows the aggregation process, making more recording necessary. At the parameter values used (σg = 8.0 × 10−5 , τ = 2 msec), the two assemblies were not quite differentiable. Unlike the situation in Figure 4, here the two assemblies activated at the same time, and thus separations
66
Gary Strangman
based on the initiation time of aggregation are impossible. A smaller value for τ reduces this A–B aggregation but makes the method less sensitive to near-synchrony. If the assembly firing rates are reduced considerably (e.g., to 10–20 Hz, or κ = 0.06) the A–B pairs aggregate much less, and thus the assembly units can be classified into their respective assemblies by computing all pairwise distances. Such a rate reduction, however, was unnecessary for cross-correlation. 4 Discussion In summary, detecting assembly activity in simultaneous neural recordings was more difficult than expected. First, cross-correlation and the gravity method may require several hundred milliseconds of assembly data in order to reveal significant correlations, even for relatively consistent assembly participation. Highly inconsistent participation may require neural recordings in excess of several seconds. Unfortunately, consistency is not a directly measurable quantity. Second, although cross-correlation and gravity were both able to differentiate units from two simultaneously active synchronous assemblies, separating three or more assemblies may well be impossible for cross-correlation. The gravity method appears somewhat more flexible in this regard. These points, as well as some of their consequences, will now be considered in more detail. 4.1 Temporal Sensitivity. The formal analysis of cross-correlated spike trains from synchronous assemblies (see Appendix A) showed that the minimum recording time to reach significance (Tmin ) is strongly dependent on κ, the consistency with which a neuron participated in the assembly. Tests of cross-correlation on simulated data supported the theoretical predictions. No formal results were obtained for the minimum recording times required for the gravity method, though the near-identical empirical Tmin ’s suggest that gravity analyses will behave similarly to cross-correlation and thus be dependent on κ, not on assembly firing rates. Using equation 3.1, and thereby assuming that assembly firing rates are significantly elevated with respect to background rates, it is theoretically possible to predict the minimum recording time required to detect a synchronous assembly in a planned experiment. Prediction relies on an estimate of κ, however, which may be unavailable. In such a case, equation 3.1 can be solved for κ and the recording time from a significant correlogram peak can be inserted for Tmin . This will give a value for κmin , which reflects the minimum consistency with which neurons must be participating in an assembly. This value, then, can be used as a reference when making further estimates of Tmin . It is noteworthy that despite perfectly synchronous assembly data and finely tuned analysis parameters to capture only those neurons participating in the assembly, the minimum recording time to detect 95% of the assembly
Detecting Synchronous Cell Assemblies
67
neuron pairs could be quite long. A minimum of 1200 msec of recording was necessary when κ = 0.2. There are two major consequences of long recording times. First, such long periods require that the assembly remains active for this entire period, or that multiple trials must be run (and thus the total assembly-active recording time must equal the derived minimum recording time). Keeping the assembly operating is usually beyond experimental control. On the other hand, summing across multiple trials has its own, familiar drawbacks. Specifically, it assumes that each assembly is activated on essentially every trial and that assembly activity is stationary across time. Neither assumption is reasonable for experimental paradigms involving learning, and the first assumption may never be reasonable. The alternative to assuming that assemblies are consistently activated on successive trials is to select data to be analyzed for assembly phenomena using some objective criteria. This approach can be successful, but it risks possible circularity by predicting effects in data that were selected for precisely those effects. The second major consequence of long minimum recording times is that they make detection of certain types of assemblies impossible using these methods. A useful illustration is a wave of activity spreading across the cortex, such as generated by a cortical seizure. Since this is a traveling wave, individual neurons participate in this event only briefly. From the point of view of a stationary microelectrode, perhaps only three or four spikes will be recorded as the wave of activity passes by. This may not be enough data to detect that a wave has passed at all. Traveling waves are usually considered pathological (Petsche and Sterc 1968), but it is certainly possible that analogous waves could form a functional part of the brain. Blum (1967), Walter (1968), Ermentrout and Cowan (1979), and others all suggested this possibility, and there is electroencephalogram evidence for traveling waves in the human cortex under normal conditions (Hughes 1995). This is not the only type of synchronous assembly that is undetectable by the correlation methods described here. For example, neurons activated in association with one (synchronous) assembly and then with another would also result in neurons only transiently correlated with any given group of cells. Both examples point to fundamental difficulties in investigating certain types of assembly behavior with these techniques. Other statistics, new or old, can avoid this limitation if they are able to exploit different parameters of assembly activity (e.g., Abeles and Gerstein 1988; Abeles et al. 1993, 1994). 4.2 Multiple Assemblies. As the number of active assemblies within a cortical area increases linearly, the number of possible spatiotemporal arrangements of those assemblies increases much faster than factorially. This huge number of possible spatiotemporal interactions is quite advantageous from the point of view of behavioral flexibility. At the same time, it is a serious problem for the neurophysiologist who is attempting to pin down the
68
Gary Strangman
spatiotemporal interaction actually in use. Cross-correlation can separate two overlapping synchronous assemblies, but this depends on a method for determining whether two independently significant correlogram peaks are significantly different from one another. More generally, when trying to separate neurons of n simultaneously active assemblies, n different correlogram peak heights will have to be differentiable. Currently no such method exists. Furthermore, in the typical experimental situation, the precise number of assemblies recorded from (i.e., the value of n) is not known in advance. In practice, these last two constraints make it unlikely that cross-correlation can differentiate three or more simultaneously active assemblies. Perhaps more problematic, cross-correlation scales very poorly. For N neurons, N(N − 1)/2 cross-correlograms must be calculated and compared. To obtain even minimal temporal resolution, cross-correlograms must be calculated using a scrolling time window. In this case, N(N − 1)/2 crosscorrelograms must be calculated for each time step. It is apparent that crosscorrelation quickly becomes untenable; for 80 neurons and 10 time steps (a resolution of 400 msec when using 4 sec of data), some 31,600 crosscorrelograms must be evaluated and compared. The time problem, but not the large number of neuron pairs, can be avoided by using the joint peri– stimulus time histogram (JPSTH; Aertsen et al. 1989), which is closely related to cross-correlation. Using the JPSTH method requires only 3160 figures, but comparison of these figures becomes considerably more complicated. The gravity method is somewhat more flexible than cross-correlation for multiple assembly analysis. The nature of the method lends itself to determining relative assembly onset times. Furthermore, it is not hampered as cross-correlation is with regard to knowing the value of n and can indeed be applied regardless of the number of active assemblies. The simultaneousonset, high-firing-rate assemblies represent a fairly extreme situation, so the difficulties of the gravity method there are of only limited consequence. In theory, the gravity method is limited by one’s ability to analyze particle clusters in an N-dimensional space. Techniques for such analyses are plentiful. In practice, limitations on the gravity method center around the aggregation parameters and the firing rate normalization procedure. As mentioned, large clustering artifacts can be minimized (but not eliminated) by shrinking σg or τ . This approach, however, makes the method less sensitive overall and thus requires more recording data. That σg is free to vary can also make it difficult to determine which clusters are significant. If rate normalization uses an average value for a neuron’s firing rate (instead of the instantaneous rate), artifacts are spread over time but not eliminated. Large increases in neuronal firing rates and sudden changes from asynchronous to synchronous firing (both of which appeared here) are most likely to cause such artifacts, though artifacts can occur spontaneously (see Fig. 4, column 3).
Detecting Synchronous Cell Assemblies
69
4.3 Other Assembly Types. The results here addressed only synchronous cell assemblies. However, there are many other plausible types of neuronal assemblies. Network-level lateral inhibition, attractor networks, and cellular automata are just a few from a wide array of alternative assembly definitions. Some assembly types may require the development of entirely new statistical tests or the application of existing techniques to a new domain. Ideally one wants to find and use the most sensitive statistic for uncovering such assemblies. Regrettably, there is no general solution to this problem. Instead, each candidate statistic must be evaluated with respect to its ability to detect each hypothesized assembly type. Techniques like principal components analysis (Nicolelis and Chapin 1994), firing space descriptions (Gerstein and Gochin 1992; Gochin et al. 1994), and JPSTH (Aertsen et al. 1989) are steps in this direction, but such techniques remain to be examined for their sensitivity to assembly phenomena. There is one final aspect of cell assembly phenomena that should be considered. It is possible that various types of functional cell assemblies are implemented simultaneously. For example, the cortex may subserve lateral inhibition networks and synchronous assemblies, with one type of assembly overlapping the other. This type of interaction is perhaps currently beyond the realm of parsimony but may need to be considered in the future. Critically, one would have to consider (1) what type(s) of assembly are present and (2) how these assembly types might interact and interfere with one another at the cellular level. Such issues are bound to become crucial as knowledge about assembly phenomena in the brain accumulates.
4.4 Conclusions. So how do we experimentally test the cell assembly hypotheses of Hebb, Abeles, and others? It would seem that four steps are necessary: (1) select an experimental paradigm to activate the hypothesized assemblies, (2) record from a sufficient number of cells in the investigated region (Strangman 1996), (3) select the statistical technique that is best able to detect the hypothesized type of assembly, and (4) record sufficient amounts of data to detect any existing assembly activity. This paper bears on the last two requirements. The results suggest that cross-correlation and the gravity method have approximately equal temporal sensitivities with regard to synchronous assemblies. The equations here can be used to help estimate the amount of data that needs to be collected. Furthermore, when dealing with multiple, overlapping assemblies, it appears that the gravity method is more flexible than cross-correlation. Ideally, one desires methods that take full advantage of the parallelism in such data. Knowing that the gravity method approaches this ideal more closely than cross-correlation may help bridge the gap between assembly-level theories of brain function and the simultaneous electrode recording data now becoming available.
70
Gary Strangman
Spikes d
e 0
eo 0 Offset
Figure 6: Theoretical cross-correlation histogram for synchronous cell assemblies (adapted from Aertsen and Gerstein 1985). Here, d is the height (in spikes) of the cross-correlogram peak, σ is the width of this peak (in msec), e0 is the average correlation level for two independent neurons, e is the average correlation level when a synchronous assembly is active, and s is the noise in correlation level e.
Appendix A The formal analysis presented here for cross-correlation closely follows the analysis of effective connectivity in cross-correlograms as given by Aertsen and Gerstein (1985). Refer to Figure 6 for notation as well as for the theoretical cross-correlation curve of two neurons active as part of the same synchronous assembly. In the absence of an active assembly, the two neurons being recorded will have an average base-level cross-correlation, e0 , such that, ignoring statistical variation, e0 = %1 %2 T1, where %1 and %2 are the neuron background firing rates, T is the length of recording, and 1 is the cross-correlogram bin width (see Table 2 for a summary of all variables). If one neuron to be correlated is part of the active synchronous assembly and the other is a neuron firing at its background rate, the average correlation level, e, is e = %10 %2 T1, where the prime indicates the (higher) firing rate of an assembly neuron when the assembly is operational. Similarly, the average correlation level for two assembly neurons, e, is e = %10 %20 T1 = (%1 + κ%0 )(%2 + κ%0 )T1,
(A.1)
where %0 is the firing rate of the assembly as a whole. Assume, then, that all spikes defined as synchronous (i.e., firing within ± ε msec of one another)
Detecting Synchronous Cell Assemblies
71
Table 2: Variables. Variable d e0
Definition Height of a cross-correlogram peak (in spikes) Average correlation level for two independent neurons (in spikes) Average correlation level during cell assembly activity (in spikes) Noise associated with the correlation level e (in spikes) Length of spike train recording (in msec) Minimum length of spike train recording (in msec) necessary to detect a synchronous assembly Consistency parameter: probability that a unit fires when the assembly source unit discharges Average firing rate of nonassembly units or neurons (in spikes/sec); Subscript 0 indicates average firing rate of the assembly source spike train Firing rate of an assembly unit or neuron = %i + κ%0 (in spikes/sec) The synchrony jitter parameter (in msec) Width of synchrony window from assembly firing (= 2ε, in msec) Cross-correlation histogram bin width (in msec) The gravity aggregation parameter (Gerstein et al. 1985 uses σ ) The gravity charge decay parameter (in msec) The gravitational charge increment at the time of a spike in neuron i
e s T Tmin κ %i %i0 ε σ 1 σg τ 1qi
are contained in a single correlogram bin (so, 2ε = σ ≤ 1). For example, if ε = 2 msec (e.g., Abeles 1982, 1991), one would set 1 ≥ 4 msec. In this case d, the height of the synchronous correlogram peak is given by µ d = κ 2 %0 T,
if σ ≤ 1.
Otherwise, d = κ 2 %0 T
¶ 1 . σ
(A.2)
Using a standard signal detection criterion, |d| ≥ 2s,
(A.3)
√ where s is the magnitude of the noise (s = e for a Poisson process). Taking equation A.3 and using e from equation A.1 and the first expression for d in equation A.2, one can solve for the recording time T; the time required to
72
Gary Strangman
detect significant correlation, T≥
41%10 %20 κ 4 %02
=
41(%1 + κ%0 )(%2 + κ%0 ) κ 4 %02
when σ ≤ 1.
(A.4)
This requires an estimate of the average firing rate of the assembly as a whole, %0 . Assuming the base firing rates, %1 and %2 , are small relative to κ%0 , they drop out and T can be expressed as Tmin =
41 κ2
(or Tmin =
4σ 2 , if σ > 1), κ 21
(A.5)
where greater-than-or-equal-to has been replaced with equals, and thus T is changed to the minimum required recording time to detect significant correlation, Tmin . So assuming that %1 , %2 ¿ κ%0 , it follows that Tmin is independent of the assembly √ firing rate. This follows from equations A.1 through A.3: the ratio of d to 2 e (the assembly detectability) is essentially unchanged by variations in %0 , and this is precisely true when %1 = %2 = 0. (See Gerstein and Aertsen 1985 for related calculations regarding the gravity method.) Appendix B Counting unique assembly Venn diagrams depends on the constraints used to define “unique.” The counting here was accomplished by a computer algorithm that evaluated vectors of binary numbers. The vectors were composed of 2n − 1 elements, where n is the number of assemblies and 2n − 1 is equal to the maximum possible number of nonempty assembly intersections. For three assemblies, the bits in the vector represented assembly intersections A, B, C, A∩B, A∩C, B∩C, and A∩B∩C, in this order (where ¯ is written as A, and so on.) Each “on” bit in a 2n − 1 element vector A∩B¯ ∩ C means the respective assembly intersection has more than zero members. Thus, 0100011 corresponds to the set {B, B∩C, A∩B∩C} and represents a nonempty assembly A being a proper subset of assembly C, which is in turn a proper subset of assembly B. All possible binary vectors of 2n −1 elements were evaluated according to the most conservative Venn diagram counting procedure. This requires that four criteria be met. First, any arrangements differing only by how many cells (> 0), or which particular cells appear in each region of the diagram, are classified as the same arrangement. This is built into the vector representation above. Second, each assembly must exist in the final intersection group. For example, for three groups A, B, and C, the set {B, B∩C, A∩B∩C} is valid, while the set {A, B, A∩B} is not a valid arrangement for three assemblies (the latter implies that assembly C is empty, meaning that there are actually only two assemblies). Third, there must be at least n distinct intersections, where n is the number of active assemblies (i.e., {A∩B∩C} is not
Detecting Synchronous Cell Assemblies
73
valid, because it represents three coincident assemblies). Thus, there must be at least n “on” bits in the vector. Fourth, one assembly must have neurons that are part of only that assembly (thus avoiding degenerate cases of two completely coincident assemblies). That is, at least one of the leftmost n bits in the vector must be on. Criteria 3 and 4 are of questionable value, and eliminating one or both slightly increases the number of unique geometric arrangements beyond the numbers reported here. Acknowledgments This article is based on work supported by a National Science Foundation Graduate Fellowship. I am indebted to J. A. Anderson for his insight into the importance of this problem. I also thank J. P. Donoghue, D. Ascher, and B. W. Connors for their helpful suggestions regarding earlier versions of this paper. References Abeles, M. 1982. Local Cortical Circuits: An Electrophysiological Study. SpringerVerlag, Berlin. Abeles, M. 1991. Corticonics. Cambridge University Press, Cambridge. Abeles, M., and Gerstein, G. L. 1988. Detecting spatiotemporal firing patterns among simultaneously recorded single neurons. J. Neurophysiol. 60, 909–924. Abeles, M., Bergman, H., Margalit, E., and Vaadia, E. 1993. Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. J. Neurophysiol. 70(4), 1629–1638. Abeles, M., Prut, Y., Bergman, H., and Vaadia, E. 1994. Synchronization in neuronal transmission and its importance for information processing. Prog. Brain Res. 102, 395–404. Abeles, M., Bergman, H., Gat, I., Meilijson, I., Seidemann, E., Tishby, N., and Vaadia, E. 1995. Cortical activity flips among quasi-stationary states. Proc. Nat. Acad. Sci. 92, 8616–8620. Aertsen, A. M. H. J., and Gerstein, G. L. 1985. Evaluation of neuronal connectivity: Sensitivity of cross-correlation. Brain Res. 340, 341–354. Aertsen, A. M. H. J., and Gerstein, G. L. 1991. Dynamic aspects of neuronal cooperativity: Fast stimulus-locked modulations of effective connectivity. In Neuronal Cooperativity, J. Kruger, ¨ ed. Springer-Verlag, Berlin. Aertsen, A. M. H. J., Bonhoeffer, T., and Kruger, ¨ J. 1987. Coherent activity in neuronal populations: Analysis and interpretation. In Physics of Cognitive Processes, E. R. Caianiello, ed. World Scientific, Singapore. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., and Palm, G. 1989. Dynamics of neuronal firing correlation: Modulation of “effective connectivity.” J. Neurophysiol. 61, 900–917. Anderson, J. A., Silverstein, J. W., Ritz, S. A., and Jones, R. S. 1977. Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psych. Rev. 84, 413–451.
74
Gary Strangman
Barlow, H. B. 1972. Single units and sensation: A neuron doctrine for perceptual psychology? Perception 1, 371–394. Barlow, H. B. 1992. The biological role of neocortex. In Information Processing in the Cortex, A. Aertsen and V. Braitenberg, eds., pp. 53–80. Springer-Verlag, Berlin. Bernard, C., Axelrad, H., and Giraud, B. G. 1993. Effects of collateral inhibition in a model of the immature rat cerebellar cortex: Multineuron correlations. Cog. Brain Res. 1, 100–122. Blum, H. 1967. A new model of global brain function. Perspect. Biol. Med. 10, 381–408. Caton, R. 1875. The electric currents of the brain. British Med. J. 2, 278. Constable, R. T., McCarthy, G., Allison, T., Anderson, A. W., and Gore, J. C. 1993. Functional brain imaging at 1.5T using conventional gradient echo MR imaging techniques. Magn. Reson. Imaging 11, 451–459. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern. 60, 121–130. Edelman, G. M. 1987. Neural Darwinism: The Theory of Neuronal Group Selection. Basic Books, New York. Ermentrout, G. B., and Cowan, J. D. 1979. Temporal oscillations in neuronal nets. J. Math. Biol. 7, 265–280. Falk, C. X., Wu, J. Y., Cohen, L. B., and Tang, A. C. 1993. Nonuniform expression of habituation in the activity of distinct classes of neurons in the Aplysia abdominal ganglion. J. Neurosci. 13(9), 4072–4081. Georgopoulos, A. P., Schwartz, A. B., and Kettner, R. E. 1986. Neuronal population coding of movement direction. Science 233, 1416–1419. Gerstein, G. L., and Aertsen, A. M. H. J. 1985. Representation of cooperative firing activity among simultaneously recorded neurons. J. Neurophysiol. 54, 1513–1528. Gerstein, G. L., Bedenbaugh, P., and Aertsen, A. M. H. J. 1989. Neuronal assemblies. IEEE Trans. Biomed. Eng. 36, 4–14. Gerstein, G. L., and Gochin, P. M. 1992. Neuronal population coding and the elephant. In Information Processing in the Cortex, A. Aertsen and V. Braitenberg, eds., pp. 139–160. Springer-Verlag, Berlin. Gerstein, G. L., and Perkel, D. H. 1969. Simultaneously recorded trains of action potentials: Analysis and functional interpretation. Science 164, 828–830. Gerstein, G. L., and Perkel, D. H. 1972. Mutual temporal relationships among neuronal spike trains. Biophys. J. 12, 453–473. Gerstein, G. L., Perkel, D. H., and Subramanian, K. H. 1978. Identification of functionally related neural assemblies. Brain Res. 140, 43–62. Gerstein, G. L., Perkel, D. H., and Dayhoff, J. E. 1985. Cooperative firing activity in simultaneously recorded populations of neurons: Detection and measurement. J. Neurosci. 5, 881–889. Gochin, P. M., Kaltenbach, J. A., and Gerstein, G. L. 1989. Coordinated activity of neuron pairs in anesthetized rat dorsal cochlear nucleus. Brain Res. 497, 1–11.
Detecting Synchronous Cell Assemblies
75
Gochin, P. M., Columbo, M., Dorfman, G. A., Gerstein, G. L., and Gross, C. G. 1994. Neural ensemble coding in inferior temporal cortex. J. Neurophysiol. 71, 2325–2337. Grafton, S. T., Woods, R. P., Mazziotta, J. C., and Phelps, M. E. 1991. Somatotopic mapping of the primary motor cortex in humans: Activation studies with cerebral blood flow and positron emission tomography. J. Neurophysiol. 66, 735–743. Gross, G. W., Rieske, E., Kreutzberg, G. W., and Meyer A. 1977. A new fixed-array multi-microelectrode system designed for long-term monitoring of extracellular single unit neuronal activity in vitro. Neurosci. Lett. 6, 101–105. Gray, C. M., Konig, ¨ P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature 338, 334–337. Hebb, D. O. 1949. The Organization of Behavior. John Wiley, New York. Hinton, G. E., McClelland, J. L., and Rumelhart, D. E. 1986. Distributed representations. In Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland, eds., vol. 1, pp. 77–109. MIT Press, Cambridge, MA. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci. 79, 2554–2558. Hughes, J. R. 1995. The phenomenon of travelling waves: A review. Clin. Electroencephalogr. 26, 1–6. James, W. 1892. Psychology (Briefer Course). Holt, New York. Knuth, D. E. 1981. Seminumerical Algorithms. Addison-Wesley, Reading, MA. Kubie, L. S. 1930. A theoretical application to some neurological problems of the properties of excitation waves which move in closed circuits. Brain 53, 166–177. McNaughton, B. L., O’Keefe, J., and Barnes, C. A. 1983. The stereotrode: A new technique for simultaneous isolation of several single units in the central nervous system from multiple unit records. J. Neurosci. Methods 8, 391–397. Moore, G. P., Perkel, D. H., and Segundo, J. P. 1966. Statistical analysis and functional interpretation of neuronal spike data. Ann. Rev. Physiol. 28, 493– 522. Nicolelis, M. A., and Chapin, J. K. 1994. Spatiotemporal structure of somatosensory responses of many-neuron ensembles in the rat ventral posterior medial nucleus of the thalamus. J. Neurosci. 14(6), 3511–3532. Nicolelis, M. A., Lin, R. C. S., Woodward, D. J., and Chapin, J. K. 1993. Dynamic distributed properties of many-neuron ensembles in the ventral posterior medial thalamus of awake rats. Proc. Natl. Acad. Sci. USA 90, 2212–2216. Nordhausen, C. T., Rousche, P. J., and Normann, R. A. 1994. Optimizing recording capabilities of the Utah Intracortical Electrode Array. Brain Res. 637, 27–36. Palm, G. 1982. Neural Assemblies: An Alternative Approach to Artificial Intelligence. Springer-Verlag, Berlin. Perkel, D. H., Gerstein, G. L., and Moore, G. P. 1967. Neuronal spike trains and stochastic point processes. II Simultaneous spike trains. Biophys. J. 7, 419–440. Perkel, D. H., Gerstein, G. L., Smith, M. S., and Tatton, W. G. 1975. Nerve impulse patterns: A quantitative display technique for three neurons. Brain Res. 100, 271–296.
76
Gary Strangman
Petsche, H., and Sterc, J. 1968. The significance of the cortex for the travelling phenomenon of brain waves. Electroenceph. Clin. Neurophysiol. 25, 11–22. Pickard, R. S., and Welberry, J. R. 1976. Printed circuit microelectrodes and their application to honeybee brain. J. Exp. Biol. 64, 39–44. Rao, S. M., Binder, J. R., Bandettini, P. A., Hammeke, T. A., Yetkin, F. Z., Jesmarowicz, A., Lisk, L. M., Morris, G. L., Mueller, W. M., Estkowski, L. D., Wong, W. C., Haughton, U. M., and Hyde, J. S. 1993. Functional magnetic resonance imaging of complex human movements. Neurology 43, 2311–2318. Roland, P. E. 1982. Cortical regulation of selective attention in man: A regional cerebral blood flow study. J. Neurophysiol. 48, 1059–1078. Strangman, G. 1996. Searching for cell assemblies: How many electrodes do I need? J. Computational Neuroscience. 3, 111–124. Vaadia, E., Bergman, H., and Abeles, M. 1989. Neuronal activities related to higher brain functions—theoretical and experimental implications. IEEE Trans. Biomed. Eng. 36, 25–35. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., and Aertsen, A. 1995. Dynamics of neuronal interactions in monkey cortex in relation to behavioral events. Nature 373, 515–518. Walter, W. G. 1968. Electric signs of expectancy and decision in the human brain. In Cybernetic Problems in Bionics, H. L. Oestreocher and D. R. Moore, eds., pp. 361–396. Gordon and Breach Science Publishers, New York. Wilson, M. A., and McNaughton, B. L. 1993. Dynamics of the hippocampal ensemble code for space. Science 261, 1055–1057. Yamada, S., Nakashima, M., Matsumoto, K., and Shiono, S. 1993. Information theoretic analysis of action potential trains. Biol. Cybern. 68, 215–220.
Received June 26, 1995; accepted April 11, 1996.
Communicated by Bard Ermentrout
A Simple Neural Network Exhibiting Selective Activation of Neuronal Ensembles: From Winner-Take-All to Winners-Share-All Tomoki Fukai Department of Electronics, Tokai University, Kitakaname 1117, Hiratsuka, Kanagawa, Japan Laboratory for Neural Modeling, Frontier Research Program, RIKEN (Institute of Physical and Chemical Research), 2-1 Hirosawa, Wako, Saitama 351-01, Japan
Shigeru Tanaka Laboratory for Neural Modeling, Frontier Research Program, RIKEN (Institute of Physical and Chemical Research), 2-1 Hirosawa, Wako, Saitama 351-01, Japan
A neuroecological equation of the Lotka-Volterra type for mean firing rate is derived from the conventional membrane dynamics of a neural network with lateral inhibition and self-inhibition. Neural selection mechanisms employed by the competitive neural network receiving external inputs are studied with analytic and numerical calculations. A remarkable finding is that the strength of lateral inhibition relative to that of self-inhibition is crucial for determining the steady states of the network among three qualitatively different types of behavior. Equal strength of both types of inhibitory connections leads the network to the well-known winnertake-all behavior. If, however, the lateral inhibition is weaker than the self-inhibition, a certain number of neurons are activated in the steady states or the number of winners is in general more than one (the winnersshare-all behavior). On the other hand, if the self-inhibition is weaker than the lateral one, only one neuron is activated, but the winner is not necessarily the neuron receiving the largest input. It is suggested that our simple network model provides a mathematical basis for understanding neural selection mechanisms. 1 Introduction It is believed that information processing is performed by many neurons in a highly parallel distributed manner. For example, about half of the neocortex was shown to be involved in visual information processing in the macaque monkey (Van Essen et al. 1984). This implies that coactivation of a large population of neurons is a fundamental strategy employed by the Neural Computation 9, 77–97 (1997)
c 1997 Massachusetts Institute of Technology °
78
Tomoki Fukai and Shigeru Tanaka
brain in sensory information processing and motor control. In fact, there is evidence that the neuronal population coding is adopted in various regions of the brain, including the primary visual cortex (Gilbert and Wiesel 1990), the superior colliculus for saccadic eye movements (Wurtz et al. 1980; Van Opstal and Van Gisbergen 1989), the inferotemporal cortex for human face recognition of the monkey (Young and Yamane 1992), the motor cortex decoding arm movements of the monkey (Georgopoulos et al. 1993), and the CA1 region of rat hippocampus encoding place information (Wilson and McNaughton 1993). The question is how a particular population of neurons is selected for engagement in an information processing task. Furthermore, a similar neural selection seems to provide a functional basis for active information processing of the brain such as the planning and execution of voluntary movement. It seems that voluntary movement is initiated by a selective transformation of neural activities coded in the several motor cortices into temporal sequences of motor commands through the subcortical nuclei (Hikosaka 1991; Tamori and Tanaka 1993). Actually, nerve projections from basal ganglia, which receive inputs from the motor cortices, converge to parts of the thalamus (Alexander et al. 1986). This implies that the number of active neurons involved in voluntary movement is reduced in the course of generating temporal sequences of neural activities related to motor actions from spatially distributed information in the motor cortices. In other words, it is likely that the subcortical nuclei are involved in the selection of neural activities. Because similar circuits are observed in almost the entire cerebral cortex besides the motor cortices (Alexander et al. 1986), such neural selection might be common in the processing of temporal information as in language generation and logical thinking. In this sense, it is of importance to understand mechanisms of neural selection in order to build computational models for cognitive functions as well as motor functions. A rule for selection requires some competition in general. In order to model the neural competition that might be induced in the brain, we propose a simple model describing the competition of neural activities in terms of excitatory inputs and all-to-all recurrent inhibitory connections including self-inhibition. Competitive behavior of neural systems has been repeatedly reported in the literature (Grossberg 1973; Cohen and Grossberg 1983; Yuille and Grzywacz 1989; Coultrip et al. 1992; Ermentrout 1992; Kaski and Kohonen 1994). In this paper, particular attention is paid to qualitative and quantitative changes in the selection of winners as the strength of the lateral inhibition relative to that of the self-inhibition is varied. When the ratio of strength of the lateral inhibition to that of the self-inhibition is larger than unity, only one neuron is selected to be active. On the other hand, when the ratio is smaller than unity, a certain number of neurons are selected to be active. We examine properties of such “winners-share-all” solutions to the dynamic equation of neural activities, which is derived from the conventional equations of the membrane dynamics and the sigmoidal output
A Simple Neural Network Exhibiting Selective Activation
79
function in order to make analytical studies possible. Our equation can be classified as one type of the Lotka-Volterra equations that have been used for the description of ecological systems. Thus, we may say that temporal information is generated through selection by the ecological system of neural activities. In the literature, the K-winner-take-all behavior is known as the solution that admits K (≥ 1) winners for competition in inhibitory neural networks similar to the present one (Majani et al. 1989). However, in the K-winner networks, the solutions could be systematically studied only for homogeneous external inputs (i.e., when they are neuron independent), and their dynamic behavior with nonhomogeneous inputs is in general complicated (Wolfe et al. 1991). Hence particular attention was paid to the behavior in which winners were selected according to the initial conditions. On the other hand, we show that winners are always selected according to the magnitudes of external stimuli for arbitrary initial states when each neuron receives a sufficiently strong inhibitory feedback. Such competitive behavior implies that the neural network always evolves into an internal state well representing the structure of the external world. In this paper, we analyze the stable fixed-point solutions, which can be classified into the winners-share-all or winner-take-all type, to our dynamic equation. Simulated results are also shown for both types of solutions. Then, limiting ourselves to the winner-take-all solutions, we examine analytically the relaxation of one fixed point to another when the external inputs suddenly change. 2 Derivation of the Ecological Equation for Neurodynamics For the sake of later exact analysis of competitive network behavior, we derive a Lotka-Volterra equation for firing activity from the standard equation for the membrane dynamics. Such a Lotka-Volterra dynamics was first discussed by Cowan (1968, 1970). According to the conventional model for neural dynamics, the membrane potential and firing rate of the ith neuron are expressed as follows: dui = −λui + (input current), dt
(2.1)
zi = f (ui − θ), where f (ui − θ) represents a nonlinear output function. If we assume that this function is given by the logistic function f (x) = f0 [1 + exp(−βx)]−1 ,
(2.2)
80
Tomoki Fukai and Shigeru Tanaka
the dynamics of the firing rate can be expressed by µ · ¶ ¸ βzi ( f0 − zi ) f0 − zi λ dzi = −λθ + log + (input current) . (2.3) dt f0 β zi The input current is assumed to be given by afferent inputs and lateral connections: (input current) = γ + Wi −
X
Vii0 zi0 ,
(2.4)
i0
where γ represents the input current induced by the afferent inputs nonspecific to each cell, and Wi represents the input current generated by the specific afferent inputs. Vii0 is the strength of the lateral inhibitory connections. If the firing rate is always much smaller than its maximum value f0 , this equation can be simplified as " # ¶ µ X f0 − zi λ dzi = zi 1 + Wi − zi ( f0 − zi ) log , (2.5) Vii0 zi0 + dt β f0 (γ − λθ ) zi i0 by omitting the factor ( f0 − zi )/f0 in the first term without changing a qualitative feature of the dynamics. Here γ − λθ was assumed to be positive, and the time t, specific input current Wi , and lateral inhibition Vii0 were rescaled as β(γ − λθ)t → t, Wi /(γ − λθ ) → Wi and Vii0 /(γ − λθ ) → Vii0 , respectively. The second term, indicating that the nonlinear output function has two asymptotes, takes positive values for zi ¿ f0 , while it takes negative values for zi close to f0 . Note that the boundedness of the membrane potential in the original equation (2.1) is preserved owing to the presence of this term. Since our system has global inhibitory connections for which the divergence in activity never occurs, we do not need to retain the negativity of the term for zi close to f0 . Thus, for simplicity, the term can be replaced by a positive constant ε. As we will see later, this replacement does not change the qualitative behavior of the dynamics. Finally we obtain the following equation for neurodynamics: " # X dzi = zi 1 + Wi − Vii0 zi0 + ε, dt i0
(2.6)
which is a type of the Lotka-Volterra equation. In the following, we consider an N-neuron system that has self-inhibition as well as uniform lateral inhibition: ½ 1 i = i0 (2.7) Vii0 = k i 6= i0 ,
A Simple Neural Network Exhibiting Selective Activation
81
where k is a positive constant representing the relative strength of the lateral inhibitory connections to the self-inhibitory ones. The self-inhibition is introduced since biological neural networks often have local inhibitory interneurons that deliver feedback inhibition to the cells activating those interneurons (Shepherd 1990). The lateral inhibition is assumed to be uniform for the sake of mathematical simplicity. Thus equation 2.6 is written as X dzi = zi (1 − zi − k zj + Wi ) + ε, dt j6=i
i = 1, . . . , N
(2.8)
which yields the basic equation for the present analytic study of neurodynamics. 3 The Dynamic Properties and the Steady States of the Model The network model (see equation 2.8) with k = 1 was first derived for a selforganization model of primary visual cortex by one of the present authors (Tanaka 1990) to describe the competitive evolution of densities of synapses in local cortical regions. It is also possible to interpret the network model formally as a type of shunting inhibition model (Grossberg 1973; Cohen and Grossberg 1983). By using new variables xi = zi − µ with ε = (1 + Nk − k)µ2 , we can rewrite equation 2.8 as dxi = −(1 + Nk − k)µ(ε)xi + (1 + Wi − xi )(µ(ε) + xi ) dt X xj , − (xi + µ(ε))k
(3.1)
j6=i
which represents a shunting model with linear-response cells receiving cellnonspecific excitatory input µ(ε), positive feedback xi , and lateral inhibition P k j6=i xj . From equation 3.1, we can immediately see that each xi can fluctuate within a finite interval [−µ(ε), 1 + Wi ] if its initial states lie in the same interval. This implies that zi also fluctuates in a finite interval of positive values. An unusual feature of this shunting inhibition model is that the afferent input Wi appears in the cutoff factor for excitatory input, and accordingly the cutoff activity level is stimulus dependent. Also xi is usually interpreted as the membrane potential in the shunting models, whereas it may be interpreted as the firing rate in the present model. Cohen and Grossberg (1983) showed that nonlinear systems like equation 3.1 have global Lyapunov functions, which ensure absolute stability of dynamic evolution of the systems. In fact, by introducing zi = y2i , we can transform equation 2.8 into the following gradient dynamics in which energy function E is minimized with time: ∂E dyi =− , dt ∂yi
(3.2)
82
Tomoki Fukai and Shigeru Tanaka
with the energy function given by
E({yi }) = −
N X 1 + Wi i=1
4
y2i
à !2 N N N 1−k X k X εX 4 2 + yi + yi − log yi . (3.3) 8 i=1 8 i=1 2 i=1
This ensures the global stability of steady-state solutions to equation 2.8. The time evolution of the network state with k = 1 was reported to exhibit the winner-take-all (WTA) behavior in which activities of all cells except one receiving a maximum input are eliminated (Tanaka 1990). The other cases of 0 < k < 1 and k > 1 also show intriguing properties, which we call winners-share-all (WSA) and variant winner-take-all (VWTA), respectively. The WSA behavior admits more than one winner receiving inputs larger than some critical value determined by the distribution of strength of the inputs. The VWTA behavior admits only one winner, but it can be any cell receiving an input larger than some critical value. We first discuss the two cases of 0 < k < 1 and k > 1 since these cases are of primary concern in the present paper and can be dealt with in a similar manner. The case of k = 1 is also reviewed to complete the survey of the network dynamics. 3.1 WSA Case (0 < k < 1). We will derive an equilibrium solution, which represents the WSA behavior, to equation 2.8 when the strength of lateral inhibitory connections satisfies 0 < k < 1. We assume that ε and all Wi ’s are positive. It is also assumed that ε satisfies dε ¿ Wi for all i, with d being the number of winners in the steady state. We first examine solutions for ε = 0 and then analyze those for ε 6= 0 in a perturbative manner. We can explicitly study the stability of the solutions only for ε = 0. Therefore the results of the stability analysis hold only when ε is not too large. Studying this restricted case, however, should be sufficient for practical purposes since ε can be infinitesimally small in many applications of the present model. We can assume, without loss of generality, that external input Wi ’s obey the following inequality regarding their magnitudes: W1 ≥ W2 ≥ W3 ≥ · · · ≥ WN−1 ≥ WN ≥ 0.
(3.4)
A stable fixed-point solution {z(D) i } given by dzi /dt = 0 (i = 1 · · · N) for k 6= 1 and ε = 0 satisfies X j∈D
Kij zj(D) = 1 + Wi , z(D) = 0, i
i∈D (3.5) i∈ /D
A Simple Neural Network Exhibiting Selective Activation
83
where D = {1, 2, · · · , d} indicates a set of winners, and d × d matrix Kˆ = (Kij ) is given by 1 k Kˆ = . .. k
k 1 .. . ···
··· k k
k ··· .. . k
k k .. . . 1
(3.6)
When k 6= 1, Kˆ is not singular and has an inverse matrix, Kˆ −1 =
1 (k − 1)α
k−α k .. . k
k k−α .. . ···
··· k k
k ··· .. . k
k k .. . k−α
,
(3.7)
where α = dk + 1 − k > 0. Now defining ζi for i = 1, 2, . . . , N by Wi kd 1 + + hWiD , α 1 − k (k − 1)α 1X Wj , hWiD = d j∈D ζi =
(3.8) (3.9)
we obtain the stable fixed-point solution for k 6= 1 and ε = 0 as follows: = ζi δiD , z(D) i ½ 1 i∈D . δiD = 0 i∈ /D
(3.10)
The winners receiving larger inputs acquire larger values for nonvanishing z(D) i ’s as long as 0 < k < 1 is satisfied. Thus for this range of the magnitude of lateral inhibition, the network system exhibits the WSA behavior. Now we examine the stability conditions of solution 3.10. Substituting + δz(D) zi (t) = z(D) i i (t) into equation 3.8 and retaining the terms linear in the fluctuation δz(D) i (t), we obtain dδz(D) i dt
=
P −ζi (δz(D) +k δzj(D) ) + o(δz(D)2 ), i i j=1...N,j6=i
(1 −
k)ζi δz(D) i
+
o(δz(D)2 ). i
i∈D (3.11) i∈ /D
84
Tomoki Fukai and Shigeru Tanaka
Then the stability of the solution for d > 1 is determined by the eigenvalues of the following matrix:
−ζ1 −kζ2 .. . ˆ = −kζd M 0 . ..
−kζ1 −ζ2 ···
0
−kζ1 −kζ2 .. . −kζd ··· .. . ···
··· −kζ2 −ζd 0 .. . 0
··· ··· .. . −kζd (1 − k)ζd+1 .. . 0
··· ···
··· ···
−kζd 0 .. . ···
··· ··· ··· 0
−kζ1 −kζ2 .. . .(3.12) −kζd 0 .. . (1 − k)ζN
ˆ should be negative for the stability of the lowest-order All eigenvalues of M solution. This requires that ζi be positive for d winners and negative for (N − d) losers. Therefore the stability condition is consistent with the condition zD i > 0 (i ∈ D ) for the existence of the WSA solution. The condition can be expressed as 1 + Wmin >
kd (hWiD − Wmin ), 1−k
(3.13)
where Wmin is the smallest external input among Wi ’s for i ∈ D: Wmin = Wd by assumption 3.4. Since the right-hand side of equation 3.13 is an increasing function of d while the left-hand side is a decreasing function, there exists a critical upper bound dc above which condition 3.13 is not satisfied, or equivalently ζdc +1 , ζdc +2 , . . . , ζN < 0 (see Appendix A). This critical value gives the actual number of winners. Condition 3.13 indicates that the number of winners decreases as the relative strength k of the lateral inhibition approaches unity. The solution with no winner, d = 0, is forbidden even for k → 1− . From equations 3.8 and 3.12 with d = 0, it is found that all ζi = (1 − k)−1 (1 + Wi )’s must be negative in that case to ensure the stability of such a solution. This, however, never occurs for 0 < k < 1. Indeed, the network model exhibits the WTA behavior for k = 1, as shown later, and hence the dynamic behavior of the network continuously changes from WSA to WTA as k → 1− . From evolution equation 2.8 with ε = 0, we see that once some cells = 0, they can never become winners even when become losers with z(D) i the magnitudes of Wi ’s are changed to take a different order. Thus positive ε should be retained to ensure that the network undergoes transition in its activity pattern to adapt to changes in external inputs. The perturbative steady-state solution in the leading order of ε is easily obtained as = (ζi + εσi )δiD − ε z(D) i
1 (1 − δiD ) + O(ε 2 ), (1 − k)ζi
(3.14)
A Simple Neural Network Exhibiting Selective Activation
85
where σi =
X 1 (αζi−1 − k ζj−1 ). (1 − k)α j∈D
(3.15)
The stability condition at the zeroth order of ε (ζi > 0 for i ∈ D and ζi < 0 for ∈ / D) ensures that activities of losers are slightly raised from the zero level by positive ε. The averaged activity of the winners is also raised by P P (ε/d) j∈D σj = (ε/dα) j∈D ζj−1 . In Figure 1a, we show an example of the time course of the network state obtained for 0 < k < 1 by numerical simulations of the model network with 30 cells. Initial states at t = 0 were randomly selected in the interval [0, 1]. The inhibitory interactions reduce the net activity of the whole network system rapidly in the initial transient of the time evolution and then the surviving net activity is gradually distributed to winners according to the mechanism of the WSA. 3.2 VWTA Case (k > 1). For a stronger lateral inhibition of k > 1, the solution given by equation 3.14 is stable at the zeroth order of ε if the con/ D are obeyed. It is easy to see that ditions ζi < 0 for i ∈ D and ζi > 0 for i ∈ the number of winners cannot be more than one: It is not possible to satisfy the former condition for the winner zd receiving the smallest external input k P among the winners because αζd = 1 + Wd + k−1 ( j∈D Wj − dWd ) cannot be negative. Thus all the solutions with d > 1 are unstable. For d = 1, the stability analysis shows that all ζi ’s must be positive. Let cell a be the winner. Then ζa = 1 + Wa > 0 by the assumptions made previously, and furthermore, the condition k > 1 can make ζj = 1 + (kWa − Wj )/(k − 1) positive for ∀j 6= a even when Wa is not the largest among all inputs. Thus the network exhibits the WTA behavior in the sense that only a single cell can survive with a nonvanishing activity at equilibrium. This WTA behavior, however, is rather different from the one usually assumed: Any cell a other than the one receiving the largest input can be the winner, as long as it satisfies k(1 + Wa ) > (1 + Wj ),
(3.16)
for ∀j 6= a. In the time course of the network dynamics, a winner is selected from among those satisfying equation 3.16 according to an initial state given to the network at t = 0. Thus the results of selection in general depend on both external inputs and initial conditions. We call this type of WTA behavior the variant winner-take-all behavior to distinguish it from the conventional one in which the winner is always the cell receiving the largest input. In fact, behavior similar to VWTA has already been extensively discussed for a competitive shunting network (Grossberg and Levine 1975), and the socalled K-winner-take-all network (Majani et al. 1989; Wolfe et al. 1991), that
86
Tomoki Fukai and Shigeru Tanaka
Figure 1: The time courses of cell activities obtained by numerical simulations of the model network with N = 30 for (a) k = 0.85, (b) k = 1.06, and (c) k = 1. The distribution of {Wi } used in the simulations is shown in the form of a bar graph in (b). Several cells receiving the largest external inputs are numbered according to the magnitudes of the inputs.
A Simple Neural Network Exhibiting Selective Activation
87
Figure 2: The dynamical flow of the network state is shown for the network of two cells. W1 > W2 and k(1+W2 ) > 1+W1 . The filled circles represent attractors, while the empty one represents an unstable fixed point. A winner is selected according to initial states of time evolution under the operation of the VWTA mechanism.
is similar to the present one except that external inputs are often assumed to be neuron independent. Since our primary aim was to show that the special range (k < 1) of the ratio between the self-inhibition and uniform lateral inhibition achieved the neural activity selection in terms of the magnitudes of external inputs independent of initial conditions, we show only a few examples for the VWTA behavior without an extensive analysis of it. Figure 1b shows an example of the time evolution of zi (t)’s for the network exhibiting the VWTA behavior. The cell receiving the third-largest input defeated others and became a winner in this case. To obtain a better insight into the VWTA behavior, we show in Figure 2 the dynamical flow of the network state for N = 2. Both (1 + W1 , 0) and (0, 1 + W2 ) are attractors if k(1 + W2 ) > 1 + W1 is satisfied (i.e., the network is subjected to the VWTA behavior); otherwise, the former is the only attractor (i.e., it is subjected to the WTA behavior).
88
Tomoki Fukai and Shigeru Tanaka
3.3 WTA Case (k = 1). Now we discuss the behavior of the equation for k = 1, which is described by à ! N X dzi = zi 1 − zi0 + Wi + ε . (1 ≤ i ≤ N) . (3.17) dt i0 We can obtain fixed points by setting the time derivative dzi /dt equal to 0 in equation 3.17. Let us assume that Wi 6= Wi0 for any i and i0 such that i 6= i0 . By considering the first order in the perturbation expansion of ε, we obtain N fixed points, zE(n) (n = 1, . . . , N): X 1 1 δi,n z(n) = (1 + Wn )δi,n + ε − i 1 + Wn Wn − Wj j6=n +
ε(1 − δi,n ) + o(ε 2 ). Wn − Wi
(3.18)
Next, let us examine linear stability around a particular fixed point, zE(n) . (n) (n) When we let zi = z(n) i + δzi , we can derive a linearized equation for δzi from equation 2.1: P ε 2 δz(n) if i = n, −(1 + Wn ) δzj(n) − n + o(ε ), 1 + W n (n) j dδzi = (3.19) dt P (n) ε (n) 2 δzj + o(ε ), otherwise. −(Wn − Wi )δzi − W − W n i j Since the coefficient 1 + Wn is always positive, the fixed point under consideration is stable if Wn is the maximum of all Wi ’s. Thus we can say that neural activities play a survival game based on competitive exclusion, which finally leads to a winner-take-all state. Figure 1c shows an example of the time evolution of zi (t)’s for the network showing the WTA behavior. It is seen that the cell receiving the largest input becomes the only winner. When the magnitudes of the external stimuli are suddenly changed, the network state relaxes to a new equilibrium state in which some of the old winners are superseded by losers. The relaxation time for this process is an important quantity that determines the response time of the neural system. Due to the particularly simple structure of the equilibrium states in the WTA case, the relaxation time can be analytically estimated as ¶ µ 1W 0 1 (3.20) log τ∼ = 1W 0 2ε in a certain limited case for infinitesimal values of ε (see Appendix B). Here 1W 0 represents the difference between the largest and second-largest inputs in the new external stimulus environment.
A Simple Neural Network Exhibiting Selective Activation
89
3.4 Simulations of the Original Competitive Network. Transforming the network model (equation 2.1) into the Lotka-Volterra equation allows us to study the competitive behavior analytically. This transformation is justified when the ratio λ/β is negligibly small. To examine to what extent the present Lotka-Volterra system can approximate the original network model, simulations were conducted for the network model with the connections given by equation 2.7. Some facts are to be noted before the simulation results are presented. First, the distinction between winners and losers is not so manifest in the original model as in the Lotka-Volterra system (with ² = 0), since the values of output f (ui ) never reach zero for finite values of the membrane potential. In the present simulations, a neuron unit is regarded as a loser if its output in equilibrium is less than 10−3 . Second, in some cases, the winners for equation 2.8 exhibit output values beyond the range zi < f0 to which the behavior of the original model is limited. This is because we have omitted the factor ( f0 − zi )/f0 in equation 2.3 when deriving the Lotka-Volterra system. Therefore the qualitative behavior of equation 2.1 should be examined also in such a case. Figures 3a and 3b shows the numerically obtained boundaries on which different types of solutions switch to one another for the original model. The solid curves represent the boundaries between the WTA and VWTA solutions, while the dashed curves indicate those between the WSA and WTA solutions. The former curves were determined as the boundaries below which an initial configuration such as u2 (0) = W2 , u1 (0) = u3 (0) = · · · = 0 evolves into a final configuration with u1 (t) being the largest among all. The latter curves were obtained by simulations of equation 2.1 with initial configurations such as u1 (0) = u2 (0) = · · · = 0. Locations of the boundary curves depend on the criterion to distinguish winners from losers. In the corresponding Lotka-Volterra system, the critical value of k separating the WTA and WSA solutions is k ≈ 0.91, while that separating the WTA and VWTA solutions is k+ ≈ 1.1, for external inputs assumed in the simulations (see Section 4 for details of k± ). In Figure 3a f0 = 2, and all outputs of the Lotka-Volterra system are within the range 0 < z < f0 , while in Figure 3b f0 = 1 and some outputs are not within the range. Figure 3a shows that the behavior of the original network for different values of k can be appropriately described by the Lotka-Volterra system as long as λ is sufficiently small. Furthermore, the two neural networks exhibit qualitatively the same dynamic behavior in Figure 3b. It is noted that the solid and dashed curves intersect each other at a certain value of λ. Beyond this value, the WTA behavior occurs in an initial-value dependent manner in the region between the solid and dashed curves. In other words, an only winner in this region can be either a neuron with the largest input or one with the second-largest input depending on initial values. Such network behavior is classified as VWTA behavior by definition, and the VWTA and WSA behaviors switch at the boundaries given by the solid curves.
90
Tomoki Fukai and Shigeru Tanaka
Figure 3: The behavior obtained at equilibrium from simulation of the original model given in equation 2.1 Five neurons were used, and their external inputs were 0.9, 0.7, 0.5, 0.3, and 0.1. Two parameters β and γ were fixed at 5 and 0.4, respectively, while λ was varied. The asymptote of the response function was (a) f0 = 2 and (b) f0 = 1, respectively. In the latter case, 1 + Wi > f0 for any i.
A Simple Neural Network Exhibiting Selective Activation
91
Figure 4: A schematic diagram showing the switchings between different types of solutions to our Lotka-Volterra equation. The critical values of the ratio k of strength of lateral inhibition to that of self-inhibition are given in the text.
4 Discussion For convenience in mathematical analysis, we studied the solutions to our system separately for the three cases of 0 < k < 1, k = 1, and k > 1. For a given distribution of afferent inputs satisfying equation 3.4, the critical values of k at which switchings from one class of solutions to another occur can be obtained as follows: Setting d = 2 in equation 3.13 and taking only W1 and W2 into account, we can easily see that the switching between the WTA behavior and the WSA behavior with two winners takes place at k = k− ≡ (1 + W1 )/(1 + 2W1 − W2 ). Furthermore, setting d = N in equation 3.13, we find that PNall the neurons remain activated for k less than kL ≡ (1 + WN )/(1 + Wj − NWN ). Thus no neural selection occurs for 0 < k < kL . Note WN + j=1 that 0 < kL < k− < 1. Similarly, by setting a = 2 and j = 1 in equation 3.16, we can immediately see that the neuron receiving the second-largest input can be a winner when k is larger than k+ ≡ (1 + W1 )/(1 + W2 ), where k+ > 1. It is also noted that any neuron can be a winner if k > (1 + W1 )/(1 + WN ). Thus our competitive network exhibits the WSA behavior for kL < k < k− , the WTA behavior for k− < k < k+ , the VWTA behavior for k > k+ , and no selection for 0 < k < kL (see Fig. 4). Assuming different magnitudes for the self- and lateral inhibitions seems to be biologically reasonable. In the local feedback loop in biological neural networks, an interneuron may inhibit adjacent principal cells that activate the interneuron. The strength of this self-inhibition of the principal cells is likely to be different from that of lateral inhibition. Uniform strength was assumed for the lateral inhibition for the sake of rigorous analysis of the population dynamics of neurons. Under these assumptions, the model predicts that competitive neural networks tend to exhibit the WSA behavior when the self-inhibition is stronger than the lateral one. In particular, the model suggests a possible functional role of the self-inhibition: With it, the activity selection occurs in a stimulus-dependent manner, independent of initial conditions. Let us make comparisons of performances between the present network and networks with a conventional on-center off-surround type interactions,
92
Tomoki Fukai and Shigeru Tanaka
which has been fully discussed by Grossberg (1973) using shunting inhibition models. Regarding neural connectivity, our network differs from the conventional one only in the extent of the off-surround inhibition: all-to-all inhibitory connections are assumed in our model. Thus competition occurs not in the vicinity of edges where intensities of input signals change dramatically but among all input signals involved in a stimulus pattern. Based on our analysis presented here, for large values of k (k > k− ), only one neuron remains active, and the others become silent. This means that the network behaves to enhance the contrast between input signals in a WTA manner when the off-surround inhibition is strong. On the other hand, when the value of k is in the interval kL < k < k− , which represents not-so-strong offsurround inhibition, the network enhances the contrast in a WSA manner. For extremely small values of k (k < kL ), no tuning occurs, and an output activity pattern of the network is not very different from an input pattern. Biological significance of our network model may be seen in the neural selection performed by the superior colliculus (SC). In the SC, there is a map that represents the displacement vector of saccadic eye movement (Robinson 1972). A class of neurons in the intermediate layer of the SC exhibit burst firing just before the onset of saccadic eye movement (Wurtz and Munoz 1994). To generate a specific eye movement, the selection of neural activity that encodes a specific displacement vector needs to occur. It is expected that lateral inhibition within the SC may function for this neural selection. According to Wurtz et al. (1980), the lateral inhibition is so widely distributed in the SC that the inhibitory effect influences beyond 50 degrees in the visual space. This implies that inhibitory connections among neurons in the intermediate layer may be regarded as all-to-all connections, as we assumed in our model. Based on this idea of using long-range lateral inhibition, a model proposed by Van Opstal and Van Gisbergen (1989) successfully demonstrated the generation of saccadic eye movement. Recently, Optican (1994) proposed a network model of saccade generation in which burst cells in the intermediate layer of SC interact with one another through mutual inhibition. Although our mathematical model is much simpler than biological networks in the SC, the model is expected to provide an analytical understanding of the underlying mechanism for the activity selection that may actually occur in the SC. The extension of the competitive neural network from the WTA type to the WSA type also has the following theoretical and biological significance: Sensory information is likely to be encoded in the cortex by a large population of neurons. In the neuronal population coding, a single neuron may weakly respond to more than one specific sensory stimulus; that is, a stimulus can coactivate a large number of neurons. As a result, the populations of neurons responding to different stimuli will largely overlap with one another and the cortical representation of the stimulus space inevitably becomes rather complicated. The WSA mechanism in this study seems to provide a simple and appropriate theoretical basis for exploring the
A Simple Neural Network Exhibiting Selective Activation
93
information-theoretical implications of such a complicated neuronal population coding. It is of particular interest to investigate how and to what extent a self-organized cortical map of sensory information is changed if the WSA instead of the WTA behavior is employed in a layer of featuredetecting cells. The present model network will be useful for clarifying these questions because it yields a mathematically simple and tractable description of the feature-detecting layer. Furthermore, the neural selection mechanism based on the WTA for k = 1 has recently been employed in modeling the basal ganglia and motor cortices, which are involved in the generation of voluntary movement (Tamori and Tanaka 1993). It is worthwhile to examine how a new selection rule based on the WSA behavior enhances the model performance in encoding and decoding temporal information on voluntary movement. 5 Conclusion We derived a Lotka-Volterra equation of the mean firing rate from a standard membrane equation for a network composed of mutually inhibiting neurons. By employing the equation for neural selection, we proved that the ratio k of strength of lateral inhibition to that of self-inhibition is of paramount importance in categorizing the network behavior into three different types: WSA, WTA, and VWTA. The WSA behavior implies that the number of winners is in general larger than one and increases as k decreases. The VWTA behavior gives one winner, but it is not necessarily the cell that receives the maximum input. Appendix A: The Number of Winners for Equidistantly Distributed {Wi } We show a simple case where the number of winners (dc ) can be explicitly determined from condition 3.13. Assume that the external inputs are distributed equidistantly according to Wi = Wi−1 + 1 with a constant 1 for any i. Then a short manipulation gives the following expression of dc : 3 1 dc = − + 2 k
s
µ
3 1 − 2 k
¶2
µ
¶µ
1 −1 +2 k
1 + W1 1+ 1
¶
,
(A.1)
where [· · ·] stands for the gaussian notation. We can see that dc approaches unity from above as k → 1. Appendix B: Nonlinear Relaxation Time for Synaptic Modification We are interested in how the neural activities relax from an old stable state to a new stable state when the configuration of the afferent inputs {Wi } slightly
94
Tomoki Fukai and Shigeru Tanaka
changes to {Wi0 } at t = 0. Without loss of generality, we can assume that W1 is the maximum and W2 is the second maximum of all Wi ’s for t < 0, and that W20 is the maximum and W10 is the second maximum of all Wi0 ’s for t > 0. First, we discuss the linear stability around the old stable fixed point zE(1) for t > 0. The corresponding linearized equation is P (W10 − W1 )(1 + W1 ) − (1 + W1 ) j δzj(1) (1) dzi (1) 0 for i = 1 (B.1) = + (W1 − W1 )δz1 + o(ε), dt 0 (1) otherwise. (Wi − W1 )δzi + o(ε), Comparing the order of magnitude of the first term for i = 1 in equation B.1 with that of other terms, we find that this term determines the dynamical behavior at this stage. Thus, we find that the system moves linearly with 0 respect to time from zE(1) toward the following new fixed point zE(1) which is not stable: X 1 1 0 δi,1 = (1 + W10 )δi,1 + ε − z(1) i 1 + W10 W10 − Wj0 j6=1 +
W10
ε (1 − δi,1 ) + o(ε 2 ) . − Wi0
(B.2)
This relaxation requires time τ1 , which can be roughly estimated by , τ1 =
0 (z(1) 1
−
z(1) 1 )
1 dδz(1) i = + o(ε). dt 1 + W1
(B.3)
This time scale implies that the system relaxes very rapidly to the nearest new fixed point, which is not stable. 0 0 Next the system moves from zE(1) to a new stable fixed point zE(2) . In this process, we may expect that the only relevant variables are z1 (t) and z2 (t). Hence, we can neglect time evolution for other zi (t)’s. Consequently, we can reduce equation B.1 to the following equations: dz1 = z1 (1 − z1 − z2 − r + W10 ) + ε, dt
(B.4a)
dz2 = z2 (1 − z1 − z2 − r + W20 ) + ε, (B.4b) dt P zi = o(Nε). Using new variables ξ = z1 + z2 and η = z2 , we where r = i6=1,2
obtain dξ = ξ(1 − r + W10 − ξ ) + 2ε + 1W 0 η, dt
(B.5a)
A Simple Neural Network Exhibiting Selective Activation
dη = η(1 − r + W20 − ξ ) + ε, dt
95
(B.5b)
where 1W 0 = W20 − W10 . Here, the problem is how this system changes from 0 0 one state: ξ ≈ o(1) and η = z(1) 2 ≈ ε/1W for t = 0 to the other state: ξ ≈ o(1) 0 ≈ o(1) for t > 0. We regard the last two terms 2ε + 1W 0 η in and η = z(2) 2 equation B.5a as a small perturbation. Since ξ rapidly follows the change in η, ξ can be approximated by 1W 0 η + 2ε . ξ∼ = 1 − r + W10 + 1 − r + W10
(B.6)
If we substitute ξ of equation B.5b for the right-hand side of B.6, we obtain dη = aη(b − η) + o(ε), dt
(B.7)
where a = 1W 0 /(1 − r + W10 ) and b = 1 − r + W10 . This differential equation has the asymptotic solution η(t) = b
c(t) . 1 + c(t)
(B.8)
c(t) satisfies the equation dc(t)/dt = ac(t)+o(ε). The solution to this equation is c(t) = (c0 + ε/a)eat − o(ε), where c0 is the initial value of c(t) and is given by c0 ∼ = ε/1W 0 . Let us define the relaxation time τ2 at this stage (Suzuki 1978) by c(τ2 ) = 1. η becomes half of the maximum b/2 from the initial infinitesimal value after time τ2 . Therefore, the relaxation time is given by τ2 ∼ =
¶ µ 1 1W 0 . log 1W 0 2ε
(B.9)
From the relaxation times τ1 and τ2 obtained above, it is evident that τ2 is much larger than τ1 for infinitesimal values of ε. Consequently, the total relaxation time τ can be given by τ2 . References Alexander, G. E., DeLong, M. R., and Strick, P. L. 1986. Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Ann. Rev. Neurosci. 9, 357–381. Cohen, M., and Grossberg, S. 1983. Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Trans. Systems, Man and Cybernetics 13, 815–826. Coultrip, R., Granger, R., and Lynch, G. 1992. A cortical model of winner-take-all competition via lateral inhibition. Neural Networks 5, 47–54.
96
Tomoki Fukai and Shigeru Tanaka
Cowan, J. D. 1968. Statistical mechanics of nervous nets. In Neural Networks, E. R. Caianiello, ed., pp. 181–188. Springer-Verlag, Berlin. Cowan J. D. 1970. A statistical mechanics of nervous activity. In Some Mathematical Questions in Biology, AMS Lectures on Mathematics in the Life Sciences 2, pp. 1–57. American Mathematical Society, Providence, RI. Ermentrout, B. 1992. Complex dynamics in winner-take-all neural nets with slow inhibition. Neural Networks 5, 415–431. Georgopoulos, A. P., Taira, M., and Lukashin, A. 1993. Cognitive neurophysiology of the motor cortex. Science 260, 47–52. Gilbert C. D., and Wiesel, T. N. 1990. The influence of contextual stimuli on the orientation selectivity of cells in primary visual cortex of the cat. Vision Res. 30, 1689–1701. Grossberg, S. 1973. Contour enhancement, short term memory, and constancies in reverberating neural networks. Studies in Appl. Math. 52, 213–257. Grossberg, S., and Levine, D. 1975. Some developmental and attentional biases in the contrast enhancement and short term memory in recurrent neural networks. Studies in Appl. Math. B53, 341–380. Hikosaka, O. 1991. Basal ganglia—possible role in motor coordination and learning. Current Opinion in Neurobiology 1, 638–643. Kaski, S., and Kohonen, T. 1994. Winner-take-all networks for physiological models of competitive learning. Neural Networks 7, 973–984. Majani, E., Erlarson, R., and Abu-Mostafa, Y. 1989. On the K-winners-take-all network. In Advances in Neural Information Processing Systems I, D. S. Touretzky, ed., pp. 634–642. Morgan Kaufmann, Los Altos, CA. Optican, L. M. 1994. Control of saccade trajectory by the superior colliculus. In Contemporary ocular motor and vestibular research: A tribute to David A. Robinson, A. F. Fuch, T. Brandt, U. Buttner, and D. S. Zee, eds., pp. 98–105. SpringerVerlag, Berlin. Robinson, D. A. 1972. Eye movements evoked by collicular electrical stimulation in the alert monkey. Vision Research 12, 1285–1302. Shepherd, G. M. 1990. The Synaptic Organization of the Brain. 3d ed. Oxford University Press, New York. Suzuki, M. 1978. Theory of instability, nonlinear Brownian motion and formation of macroscopic order. Phys. Lett. 67A, 339–341. Tamori, Y., and Tanaka, S. 1993. A model for functional relationships between cerebral cortex and basal ganglia in voluntary movement. Society for Neuroscience Abstracts 19, 547. Tanaka, S. 1990. Theory of self-organization of cortical maps: Mathematical framework. Neural Networks 3, 625–640. Van Essen, D. C., Newsome, W. T., and Maunsel, J. H. R. 1984. The visual field representation in striate cortex of the macaque monkey: Asymmetries, anisotropies and individual variability. Vision Res. 24, 429–448. Van Opstal, A. J., and Van Gisbergen J. A. M. 1989. A nonlinear model for collicular spatial interactions underlying the metrical properties of electrically elicited saccades. Biol. Cybern. 60, 171–183. Wilson, M. A., and McNaughton, B. 1993. Dynamics of the hippocampal ensemble code for space. Science 261, 1055–1058.
A Simple Neural Network Exhibiting Selective Activation
97
Wolfe, W. J., Mathis, D., Anderson, C., Rothman, J., Gotter, M., Brady, G., Walker, R., Duane, G., and Alaghband, G. 1991. K-winner networks. IEEE Trans. Neural Networks 2, 310–315. Wurtz, R. H., and Munoz, D. P. 1994. Organization of saccade related neurons in monkey superior colliculus. In Contemporary Ocular Motor and Vestibular Research: A Tribute to David A. Robinson, A. F. Fuch, T. Brandt, U. Buttner, and D. S. Zee, eds., pp. 520–527. Springer-Verlag, Berlin. Wurtz, R. H., Richmond, B. J., and Judge, S. J. 1980. Vision during saccadic eye movements III: Visual interactions in monkey superior colliculus. J. Neurophysiol. 43, 1168–1181. Young, M. P., and Yamane, S. 1992. Sparse population coding of faces in the inferotemporal cortex. Science 256, 1327–1331. Yuille, A. L., and Grzywacz N. M. 1989. A winner-take-all mechanism based on presynaptic inhibition feedback. Neural Comp. 1, 334–347.
Received April 25, 1995; accepted April 11, 1996.
Communicated by Eric Baum
Playing Billiards in Version Space P´al Ruj´an Fachbereich 8 Physik and ICBM, Postfach 2503, Carl-von-Ossietzky Universit¨at, 26111 Oldenburg, Germany
A ray-tracing method inspired by ergodic billiards is used to estimate the theoretically best decision rule for a given set of linear separable examples. For randomly distributed examples, the billiard estimate of the single Perceptron with best average generalization probability agrees with known analytic results, while for real-life classification problems, the generalization probability is consistently enhanced when compared to the maximal stability Perceptron. 1 Introduction Neural networks can be used for both concept learning (classification) and for function interpolation and/or extrapolation. Two basic mathematical methods seem to be particularly adequate for studying neural networks: geometry (especially combinatorial geometry) and probability theory (statistical physics). Geometry is illuminating, and probability theory is powerful. In this article I consider what is perhaps the simplest neural network, the venerable Perceptron (Rosenblatt 1958): Given a set of examples falling in two classes, and find the linear discriminant surface separating the two sets, if one exists. In this context, my goal is twofold: (1) give the optimal (or Bayesian) decision theory a geometric interpretation and (2) use that geometric content for designing a deceptively simple method for computing the single Perceptron with the best possible generalization probability. As shown in Figure 1, a Perceptron is a network consisting of N (binary or real) inputs xi and a single binary output neuron. The output neuron sums all inputs with the corresponding synaptic weights wi , then performs a threshold operation according to à σ = sign
N X
! xi wi − θ
E − θ] = ±1. = sign[(E x, w)
(1.1)
i=1
In general, the synaptic weights and the threshold can be arbitrary real E has only binary components is called a binary numbers. A network where w Perceptron. The output unit has binary values σ = ±1 and labels the class to E = (w1 , w2 , . . . , wN ) which the input vector belongs. Both the weight vector w Neural Computation 9, 99–122 (1997)
c 1997 Massachusetts Institute of Technology °
100
P´al Ruj´an
Figure 1: The Perceptron as neural network.
and the threshold θ should be learned from a set of M examples whose class is known (the training set). The Perceptron and its variants (Hertz et al. 1991) are among the few networks for which basic properties like the maximal information capacity (how many random1 examples can be faultlessly stored in the network) or the classification error of certain learning algorithms (how many examples are needed in order to achieve a given generalization probability) can be obtained analytically. Such calculations are done by first defining a learning protocol. For example, one assumes that the examples are independently and identically sampled from some stationary probability distribution; that one has access to a very large set of such examples for both training and testing purposes; that no matter how many examples one generates; that there is always a single Perceptron network solving the problem without errors; and so forth. Almost all statistical mechanical calculations also require the thermodynamic limit N → ∞, M → ∞, α = M/N = const. These assumptions are needed for technological reasons—some integrals 1 By random vectors we mean in the following vectors sampled independently and identically from a constant distribution defined on the surface of the unit sphere (E x, xE) = 1.
Playing Billiards in Version Space
101
must be concretely evaluated—but also because otherwise the problem is mathematically not well defined. In most practical situations, however, one is confronted with situations that do not satisfy some of these assumptions: One has a relatively small set of examples; their distribution is unknown and often not stationary; the examples are not independently sampled; and so forth. In a strict sense, such problems are mathematically not quantifiable. However, even a small set of examples contains some information about the nature of the problem (Polya ´ 1968). The basic question is then, among all possible networks trained on this set, which one has possibly the best generalization probability? As explained below, understanding the geometry of the version space leads to the correct answer and to algorithms for computing it. Our protocol assumes that both the training and test examples are drawn independently and identically from the distribution P(ξE ). The simplest way to generate such a rule is to define a teacher network with the same structure as shown in Figure 1, providing the correct class label σ = ±1, for M , ξE (ν) ∈ RN and their each input vector. Given a set of M examples {ξE (ν) }ν=1 corresponding binary class label σν = ±1, the task of the learning algorithm is to find a student network that mimics the teacher by choosing the network parameters that correctly classify the training examples. Equation 1.1 E xE) = θ implies that the learning task consists of finding a hyperplane (w, M : separating the positive from the negative labeled examples xE ∈ {ξE (ν) }ν=1 E ξE (α) ) ≥ θ1 , σα = 1 (w, E ξE (β) ) ≤ θ2 , σβ = −1 (w,
(1.2)
θ1 ≥ θ ≥ θ2 . The typical situation is that either none or infinitely many such solutions E θ) is called the version exist. The vector space whose points are the vectors (w, space and its subspace satisfying equation 1.2 the solution polyhedron. Most often one wishes to minimize the average probability of class error for a new xE vector drawn from P(ξE ). In other situations another network, that in average makes more errors but avoids very “bad” ones, is preferable. Yet another problem occurs when the data provided by the teacher are noisy (Opper and Haussler 1992). In what follows we shall restrict ourselves to the standard problem of minimal average classification error. In theory, one knows that for a given training set, the optimal Bayes E θ} soludecision (Opper and Haussler 1991) implies an average over all {w, tions satisfying equation 1.2. Since each solution is perfectly consistent with the training set, in the absence of any other a priori knowledge, one must consider them as equally probable. This is not the case, for instance, in the presence of noise. The best strategy is then to associate with each point in the version space a Boltzmann weight whose energy is the squared error function and whose temperature depends on the noise strength (Opper and Haussler 1992).
102
P´al Ruj´an
For examples independently and identically drawn from a constant distribution (P(ξE ) = const) Watkin (1993) has shown that in the thermodynamic limit the single Perceptron corresponding to the center of mass of the solution polyhedron has the same generalization probability as the Bayes decision. On the practical side, known learning algorithms like Adaline (Widrow and Hoff 1960; Diedrich and Opper 1987) or the maximal stability Perceptron (MSP) (Vapnik 1982; Anlauf and Biehl 1989; Ruj´an 1993) are not Bayes optimal. In fact, if all input vectors have the same length, then the maximal stability Perceptron network corresponds to the center of the largest spherical cone inscribed into the solution polyhedron, as shown in Section 4. There have been several attempts at developing learning algorithms approximating the Bayes decision. Watkin (1993) used a randomized AdaTron algorithm to sample the version space. More recently, Bouten et al. (1995) defined a class of convex functions whose minimum lies within the solution polyhedron. By changing the function’s parameters, one can obtain a minimum, which is a good approximation of the known Bayes solution. This idea is somewhat similar to the notion of an analytic center of a convex polytope introduced by Sonnevend (1988) and used extensively for designing fast linear programming algorithms (Vaidya 1990). A very promising but not yet fully exploited approach (Opper 1993) uses the Thouless-Anderson-Palmer (TAP) equations originally developed for the spin-glass problem. In this paper I propose a quite different method, based on an analogy to classical, ergodic billiards. Considering the solution polyhedron as a dynamic system, a long trajectory is generated and used to estimate the center of mass of the billiard. The same idea can be applied also to other optimization problems. Questions related to the theory of billiards are briefly considered in Section 2. Section 3 sets the stage by presenting an elementary geometric analysis in two dimensions. The more elaborated geometry of the Perceptron version space is discussed in Section 4. An elementary algorithmic implementation for open polyhedral cones and their projective geometric closures are summarized in Section 5. Numerical results and a comparison to known analytic bounds and to other learning algorithms can be found in Section 6. Conclusions and further prospects are summarized in Section 7. 2 Billiards A billiard is usually defined as a closed space region (compact set) P ∈ RN dimensions. The boundaries of a billiard are usually piecewise smooth functions. Within these boundaries, a point mass (ball) moves freely, except for the elastic collisions with the enclosing walls. Hence, the absolute value of the momentum is preserved, and the phase space, B = P × S N−1 , is the direct product of P and S N−1 , the surface of the N-dimensional unit velocity sphere. Such a simple Hamiltonian dynamics defines a flow and its
Playing Billiards in Version Space
103
Poincar´e map an automorphism. The mathematicians have defined a finely tuned hierarchy of notions related to such dynamic systems. For instance, simple ergodicity as implied by the Birkhoff-Hincsin theorem means that the average of any integrable function defined on the phase space over a single but very long trajectory equals the spatial mean (except for a set of zero measure). Furthermore, integrable functions invariant under the dynamics must be constant. From a practical point of view, this means that almost all very long trajectories will cover the phase space uniformly. Properties like mixing (Kolmogorov mixing) are stronger than ergodicity. They require the flow to mix uniformly different subsets of B . In hyperbolic systems one can go even further and construct Markov partitions defined on symbolic dynamics and eventually prove related central limit theorems. Not all convex billiards are ergodic. Notable exceptions are ellipsoidal billiards, which can be solved by a proper separation of variables (Moser 1980). Already Jacobi knew that a trajectory started close to and along the boundaries of an ellipse cannot reach a central region bounded by the socalled caustics. In addition to billiards that can be solved by a separation of variables, there are a few other exactly soluble polyhedral billiards (Kuttler and Sigillito 1984). Such solutions are intimately related to the reflection method— the billiard tiles the entire space perfectly. A notable example is the equilateral triangle billiard, first solved by Lam´e in 1852. Other examples of such integrable billiards can be obtained by mapping an exactly soluble onedimensional, many-particle system into a one-particle, high-dimensional billiard (Krishnamurthy et al. 1982). Apart from these exceptions, small perturbations in the form of the billiard usually destroy integrability and lead to chaotic behavior. For example, the stadium billiard (two half-circles joined by two parallel lines) is ergodic in a strong sense; the metric entropy is nonvanishing (Bunimovich 1979). The dynamics induced by the billiard is hyperbolic if at any point in phase space there are both expanding (unstable) and shrinking (stable) manifolds. A famous example is Sinai’s reformulation of the Lorentz gas problem. Deep mathematical methods were needed to prove the Kolmogorov mixing property and in constructing the Markov partitions for the symbolic dynamics of such systems (Bunimovich and Sinai 1980). The question whether a particular billiard is ergodic can be decided in principle by solving the Schrodinger ¨ problem for a free particle trapped in the billiard box. If the eigenfunctions corresponding to the high-energy modes are roughly constant, then the billiard is ergodic. Only a few general results are known for such quantum problems. In fact, I am not aware of theoretical results concerning the ergodic properties of convex polyhedral billiards in high dimensions. If all angles of the polyhedra are rational, then the billiard is weakly ergodic in the sense that the velocity direction will reach only rational angles (relative to the initial direction). In general, as long as two neighboring trajectories collide with the same polyhedral faces,
104
P´al Ruj´an
their distance will grow only linearly. Once they are far enough to collide with different faces of the polyhedron, their distance will abruptly increase. Hence, except for very special cases with high symmetry, it seems unlikely that high-dimensional convex polyhedra as generated by the training examples will fail to be ergodic. 3 A Simple Geometric Problem For the sake of simplicity let us illustrate the approach in a two-dimensional setting. In the next section we will show how the concepts developed here generalize to the more challenging Perceptron problem. Let P be a closed convex polygon and vE a given unit vector, defining a particular direction in the R2 space. The direction of vector vE can be also described by the angle φ it makes with the x-axis. Next, construct the line perpendicular to vE which halves the area of the polygon P , A1 = A2 (see Fig. 2a). In the next section we will show that this geometric construction is analog to the Bayes decision in version space. Choosing a set of vectors vE oriented at different angles leads to the set of “Bayes lines” seen in Figure 2b. It is obvious that these lines do not intersect at one single point. The same task is computationally not feasible in a high-dimensional version space. It is certainly more economical to compute and store a single point Er0 , which represents optimally the full information contained in Figure 2b. As evident from Figure 2b, the majority of lines passing through Er0 will not partition P in equal areas but will make some mistakes, denoted by 1A = A1 − A2 6= 0. 1A depends on both the direction of vE and on the v, P ). Different optimality criteria can be formulated polygon P , 1A = 1A(E depending on the actual application. The usual approach corresponds to minimizing the squared area-difference averaged over all possible vE directions: Z Er0 = arg min h(1A)2 i = arg min
E (1A)2 (E v)p(E v)dv
(3.1)
where p(E v) is some a priori known distribution of vectors vE. Another possible criterion optimizes the worst-case loss over all directions: n o Er1 = arg inf sup (1A)2 .
(3.2)
In what follows, the point Er0 is called the Bayes point. The calculation of the Bayes point according to equation 3.1 is computationally feasible, but a lot of computer power is still needed. A good estimate
Playing Billiards in Version Space
105
(a)
(b)
Figure 2: (a) Halving the area along a given direction. (b) The resulting Bayes lines.
of the Bayes point is given by the center of mass of the polgon P : R Erρ(Er)dA SE = RP r)dA P ρ(E where ρ(Er) = const is the surface mass density.
(3.3)
106
P´al Ruj´an
Table 1: Exact Coordinates (x, y) of the Bayes Point for the Polygon P = (1, 0), (4, 6), (9, 4), (11, 0), (9, −2) and Its Various Estimates: Center of Mass, Maximal Inscribed Circle, Trajectories of Different Lengths. Method
x
Bayes point Center of mass Largest inscribed circle
6.1048 6.1250 5.4960
Billiard—10 collisions Billiard—102 collisions Billiard—103 collisions Billiard—104 collisions Billiard—105 collisions Billiard—106 collisions
6.0012 6.1077 6.1096 6.1232 6.1239 6.1247
σx
y
σy
1.7376 1.6667 2.0672 0.701 0.250 0.089 0.028 0.010 0.003
1.6720 1.6640 1.6686 1.6670 1.6663 1.6667
h1A2 i 1.4425 1.5290 7.3551
0.490 0.095 0.027 0.011 0.004 0.003
7.0265 2.6207 1.6774 1.5459 1.5335 1.5295
Table 1 presents exact numerical results for the x and y coordinates of both the Bayes point and the center of mass. The center of mass is an excellent approximation of the Bayes point. In very high dimensions, as shown by E T. Watkin (1993), Er0 → S. A polyhedron can be described by either the list of its vertices or the set of vectors normal to its facets. The transition from one representation to another requires exponential many arithmetic operations as a function of dimension. Therefore, in typical classification tasks, this transformation is not practicable. For “round” polygons (polyhedra), the center of the smallest circumscribed and the largest inscribed circle (sphere) is a good choice for approximating the Bayes point in the vertex and the facet representation, respectively. Since in our case the polyhedron generalizing P is determined by a set of normal vectors, only the largest inscribed circle (sphere) is a feasible approximation (see Fig. 3). The numerical values of the (xR , yR ) coordinates of the center point are displayed in Table 1. A better approximation would be to compute the center of the largestvolume inscribed ellipsoid (Ruj´an in preparation), a problem also common in nonlinear optimization (Khachiyan and Todd 1993). The best-known algorithms are of order O(M3.5 ) operations, where M is in our case the number of examples. Additional logarithmic factors have been neglected (for details, see Khachiyan and Todd 1993). The purpose of this paper is to show that a reasonable estimate of rE0 can be obtained faster by following the (ergodic) trajectory of an elastic ball inside the polygon, as shown in Figure 4a for four collisions. Figure 4b shows the trajectory of Figure 4a from another perspective, by performing an appropriate reflection on each collision edge. A trajectory is periodic if
Playing Billiards in Version Space
107
Figure 3: The largest inscribed circle. The center of mass (cross) is plotted for comparison.
after a finite number of such reflections the polygon P is mapped onto itself. Fully integrable systems correspond in this respect to polygons that will fill without holes the whole space (that this geometric point of view applies also to other fully integrable systems is nicely exposed in Sutherland 1985). If the dynamics is ergodic in the sense discussed in Section 2, then a long enough trajectory should cover without holes the surface of the polygon. By computing the total center of mass of the trajectory, one should then obtain a good estimate of the center of mass. The question of whether a billiard formed by a “generic” convex polytope is ergodic is to my knowledge not solved. Extensive numerical calculations are possible only in low dimensions. The extent to which the trajectory covers P after 1000 collisions is visualized in Figure 5. By continuing this procedure, one can convince oneself that all holes are filled up, so that the trajectory will visit every point inside P . The next question is whether the area of P is homogeneously covered by the trajectory. The numerical results summarized in Table 1 were obtained by averaging over 100 different trajectories for each given length. As the length of the trajectory is increased, these averages converge to the center of mass. Also displayed are the empirical standard deviations σx,y of the trajectory center of mass coordinates and the average squared area-difference, h1A2 i.
108
P´al Ruj´an
(a)
(b)
Figure 4: (a) A trajectory with four collisions. (b) Its “straightened” form.
Playing Billiards in Version Space
109
Figure 5: A trajectory with 1000 collisions. To improve the resolution, a dotted line has been used.
4 The Geometry of Perceptrons Consider a set of M training examples, consisting of N-dimensional vectors ξE (ν) and their class σν , ν = 1, . . . , M. Now let us introduce the N + 1(ν) E = (w, E θ). In this representadimensional vectors ζE = (σν ξE (ν) , −σν ) and W tion equation 1.2 becomes equivalent to a standard set of linear inequalities E ζE (W,
(ν)
) ≥ 1 > 0.
(4.1)
E ζE (ν) ) is called the stability of the linear The parameter 1 = min{ν} (W, inequality system. A bigger 1 implies a solution that is more robust against small changes in the example vectors. (ν) The vector space whose points are the examples ζE is the example space. E corresponds to the normal vector of an N + 1-dimensional In this space W E Hence, a given hyperplane. The version space is the space of the vectors W. E example vector ζ corresponds here to the normal of a hyperplane. The inequalities (equation 4.1) define a convex polyhedral cone whose boundary (ν) M . hyperplanes are determined by the training set {ζE }ν=1 How can one use the information contained in the example set for making the best average prediction on the class label of a new vector drawn from
110
P´al Ruj´an (new)
the same distribution? Each new presented example ζE corresponds to a hyperplane in version space. The direction of its normal is defined up to a σ = ±1 factor, the corresponding class. The best possible decision for the classification of the new example follows the Bayes scheme: For each new (test) example, generate the corresponding hyperplane in version space. If this hyperplane does not intersect the solution polyhedron, consider the normal to be positive when pointing to the part of the version space containing the solution polyhedron. Hence, all Perceptrons satisfying equation 4.1 will classify unanimously the new example. If the hyperplane cuts the solution polyhedron in two parts, point the normal toward the bigger half. Therefore, the decision that minimizes the average generalization error is given by evaluating the average measure of pro versus the average measure of contra votes of all Perceptrons making no errors on the training set. We see that the Bayes decision is analogous to the geometric problem described in the previous section. However, the solution polyhedral cone is either open or is defined on the unit N + 1-dimensional hypersphere (see also Fig. 3.11b for an illustration). The practical question is how to find simple approximations for the Bayes point. One possibility is to consider the Perceptron with maximal stability, defined as the following: E W) E =1 E MSP = arg max E 1; (W, W W
(4.2)
or, equivalently, as E w) E =1 E MSP = arg maxwE {θ1 − θ2 }; (w, w
(4.3)
where θ = (θ1 + θ2 )/2 and 1 = (θ1 − θ2 )/2. The quadratic conditions E W) E = 1 [(w, E w) E = 1] are necessary because otherwise one could mul(W, E and θ1,2 by a large number, making θ1 − θ2 and thus 1 arbitrarily tiply w large. Equation 4.3 has a very simple geometric interpretation, shown in Figure 3.11. Consider the convex hulls of the vectors ξE (α) belonging to the positive examples σα = 1 and of those in the negative class ξE (β) , σβ = −1, respectively. According to equation 4.3, the Perceptron with maximal stability corresponds to the slab of maximal width one can put between the two convex hulls (the “maximal dead zone” [Lampert 1969]). Geometrically, this problem is equivalent (dual) to finding the direction of the shortest line segment connecting the two convex hulls (the minimal connector problem). Since the dual problem minimizes a quadratic function subject to linear 2 constraints, it is a quadratic programming problem. By choosing θ = θ1 +θ 2 (dotted line in Fig. 3.11 ) one obtains the MSP. E is determined by at most N + 1 vertices taken from The direction of w both convex hulls, called active constraints. Figure 3.11a shows a simple two-
Playing Billiards in Version Space
111
dimensional example; the active constraints are labeled A, B, and C, respectively. The version space is three-dimensional, as shown in Figure 3.11b. The three planes represent the constraints imposed by the examples A (left plane), B (right plane), and C (lower plane). The bar points from the origin to the point defined by the MSP solution. The sphere corresponds to the normalization constraint. If the example vectors ξE (ν) all have the same length E MSP N , then equations 4.1 and 4.2 imply that the distances between the W and the hyperplanes corresponding to active constraints are all equal to 1max N . All other hyperplanes participating in the polyhedral cone are farther away. Accordingly, the MSP corresponds to the center of the largest circle inscribed into the spherical triangle defined by the intersection of the unit sphere with the solution polyhedron, the point where the bar intersects the sphere in Figure 3.11b. A fast algorithm for computing the minimal connector requiring on average O(N2 M) operations and O(N2 ) storage place can be found in Ruj´an (1993). 5 How to Play Billiards in Version Space Each billiard game starts by first placing the ball(s) on the pool. This is not always a trivial task. In our case, the MSP algorithm (Ruj´an 1993) does it or signals that a solution does not exist. The trajectory is initiated by generating E in version space. at random a unit direction vector v The basic step consists of finding out where—on which hyperplane—the next collision will take place. The idea is to compute how much time the ball needs until it eventually hits each one of the M hyperplanes. Given a point E = (w, E θ ) in version space and a unit direction vector v E , let us denote W the distance along the hyperplane normal ζE by dn and the component of vE perpendicular to the hyperplane by vn . In this notation the flight time needed to reach this plane is given by E ζE ) dn = (W, vn = (E v, ζE ) τ =−
(5.1)
dn . vn
After computing all M flight times, one looks for the smallest positive τmin = min{ν} τ > 0. The collision will take place on the corresponding hyperplane. E 0 and the new direction v E 0 are calculated as The new point W E0 =W E + τmin v E. W 0 E =v E − 2vn ζE . v
(5.2) (5.3)
This procedure is illustrated in Figure 7a. In order to estimate the center of E and W E 0 . By assuming mass of the trajectory, one must first normalize both W
112
P´al Ruj´an
(a)
(b)
Figure 6: (a) The Perceptron with maximal stability in example space. (b) The solution polyhedron (only the bounding examples A, B, and C are shown). See text for details.
Playing Billiards in Version Space
113
Figure 7: Bouncing in version space. (a) Euclidean geometry. (b) Spherical geometry.
a constant line density, one assigns to the (normalized!) center of the segment E 0 +W E W E 0 − W. E This is then added to the actual center the length of the vector W 2 of mass—as when adding two parallel forces of different lengths. In high dimensions (N > 5), however, the difference between the mass of the full N + 1 dimensional solution polyhedron and the mass of the bounding Ndimensional boundaries becomes negligible. Hence, we could just as well record the collision points, assign them the same mass density, and construct their average. Note that by continuing the trajectory beyond the first collision plane, E one can also sample regions of the solution space where the network W makes one, two, and more mistakes (the number of mistakes equals the number of crossed boundary planes). This additional information can be used for taking an optimal decision when the examples are noisy (Opper and Haussler 1992). Since the polyhedral cone is open, the implementation of this algorithm must take into account the possibility that the trajectory might escape to infinity. The minimal flight time then becomes very large: τ > τmax . When this exception is detected, a new trajectory is started from the MSP point in yet another random direction. Hence, from a practical point of view, the polyhedral solution cone is closed by a spherical shell with radius τmax acting as a special “scatterer.” This flipper procedure is iterated until enough data are gathered.
114
P´al Ruj´an
If we are class conscious and want to remain in the billiard club, we must do a bit more. The solution polyhedral cone can be closed by normalizing the version space vectors. The billiard is now defined on a curved space. However, the same strategy works here also if between subsequent collisions one follows geodesics instead of straight lines. Figure 7b illustrates the change in direction for a small time step, leading to the well-known geodesic differential equation on the unit sphere: E˙ = v E. W ˙v E E = −W.
(5.4) (5.5)
The solution of these equations costs additional resources. Actually, the solution of the differential equation is strictly necessary only when there are no bounding planes on the actual horizon (assuming light travels along Euclidean straight lines). Once one or more boundaries are “visible,” the choice of the shortest flight time can be evaluated directly, since in the two geometries the flight time is monotonously deformed. Even so, the flipper procedure is obviously faster. Both variants deliver interesting additional information, like the mean escape time of a trajectory or the number of times a given border plane has been bounced on. The collision frequency classifies the training examples according to their “surface” area in the solution polyhedron—a good measure of their relative “importance.” 6 Results and Performance This section contains the results of numerical experiments performed to test the billiard algorithm. First, the ball is placed inside the billiard with the MSP algorithm (as described in Ruj´an 1993). Next, a number of typically O(N2 ) collision points are generated. Since the computation of one collision point requires M scalar products of N-dimensional vectors, the total load of this algorithm is O(MN3 ). The choice for N2 collision points is somewhat arbitrary and is based on the following considerations. By using the billiard method, one generates many collision points lying on the borders of the solution polyhedron. We could try to use this information for approximating the solution polyhedron with an ellipsoidal cone. The number of free parameters involved in the fit is of the order O(N2 ). Hence, at least a constant times that many points are needed. Such a fitted ellipsoid also delivers an estimate on the decision uncertainty. If one is not interested in this information, it is enough to monitor how one or more projections of the center of mass estimate changes as the number of collisions increases. Once these projections become stable, the program can be stopped. T. Watkin (1993) argues that the number of sampling points should be of O(1). I find his arguments unconvincing. For example, Figure 8 shows how
Playing Billiards in Version Space
115
Figure 8: Average of the generalization probability as a function of the number of collisions, N = 100, M = 1000.
the average generalization probability changes as the number of collisions increases up to N2 steps. Finding a point within the solution polyhedron is algorithmically equivalent to solving a linear programming problem. Therefore, his method of sampling the version space with randomly started AdaTron gradient descent (or any other Perceptron learning method) requires at least O(N2 M) operations per sampling point, compared to O(NM) per collison in the billiard method. In a first test, a known “teacher” Perceptron TE was used to label the randomly generated examples for training. The generalization probability G(α), α = M/N was then computed by measuring the overlap between the resulting solution (“student”) Perceptron with the teacher Perceptron: G(α) = 1 −
1 E w); E T) E = (w, E (T, E w) E = 1. cos−1 (T, π
(6.1)
The numerical results obtained from 10 different realizations for N = 100 are compared with the theoretical Bayes learning curve in Figure 9. Figure 10 shows a comparison between the billiard results and the MSP results. Although the differences seem small compared to the error bars, the billiard solution was in all realizations consistently superior to the MSP. Figure 11 shows how the number of constraints (examples) bordering the solution polyhedron changes with increasing α = M/N.
116
P´al Ruj´an
Figure 9: The theoretical Bayes learning curve (solid line) versus billiard results obtained in 10 independent trials. G(α) is the generalization probability, α = M , N N = 100.
As the number of examples increases, the probability of escape from the solution polyhedron decreases, and the network reaches its storage capacity. Figure 12 shows the average number of collisions before escape as a function of classification error, parameterized through α = M/N. Therefore, by measuring either the escape rate or the number of “active” examples, we can estimate the generalization error without using test examples. Note, however, that such calibration graphs should be calculated for each specific distribution of the input vectors. Randomly generated training examples lead to rather isotropic polyhedra, as illustrated by the small difference between the Bayes and the MSP learning curves (see Fig. 10). Therefore, we expect that the billiard approach leads to bigger improvements when applied to real-life problems with strongly anisotropic solution polyhedra. Similar improvements can be expected when using constructive methods for multilayer Perceptrons that use iteratively the Perceptron algorithm (Marchand et al. 1989). Such procedures have been used, for example, for classifying handwritten digits (Knerr et al. 1990). For all such example sets available to me, the introduction of the billiard algorithm on top of the MSP leads to consistent improvements of up to 5% in classification probability.
Playing Billiards in Version Space
117
Figure 10: Average generalization probability, same parameters as in Figure 9. Lower curve: the maximal stability Perceptron algorithm. Upper curve: the billiard algorithm.
A publicly available data set known as the sonar problem (Gorman and Sejnowski 1988; Fahlman N.d.) considers the problem of deciding between rocks and mines from sonar data. The input space is N = 60 dimensional and the whole set consists of 111 mine and 97 rock examples. We go down the list of rock signals putting alternate members into the training and test sets; we do the same with the set of mine signals. Since the data sets are sorted by increasing azimuth, this gives us training and testing with equal lengths (104 signals) and with the population of azimuth angles matched as closely as possible. Gorman and Sejnowski (1988) consider a three-layer feedforward neural network architecture with a different number of hidden units, trained by the backpropagation algorithm. Their results are summarized and compared to our results in Table 2. By applying the MSP algorithm, we first found that the whole data (training plus test) set is linearly separable. Second, by using the MSP on the training set we obtain a 77.5% classification rate on the test set (compared to 73.1% in Gorman and Senjowski 1988). Playing billiard leads to an 83.4% classification rate (in both cases, the training set was faultlessly classified). This improvement amount is also typical for other applications. The num-
118
P´al Ruj´an
Figure 11: Number of different identified polyhedral borders versus G(α), N = 100. Diamonds: maximal stability Perceptron, saturating at 101. Crosses: billiard algorithm.
Figure 12: Mean number of collisions before escaping the solution polyhedral cone versus G(α), N = 100.
Playing Billiards in Version Space
119
Table 2: Results for the Sonar Classification Problem from Gorman and Sejnowski (1988). Hidden units
% right on training set
Standard deviation
% right on test set
Standard deviation
0
79.3
3.4
73.1
4.8
0–MSP 0–Billiard
100.0 100.0
— —
77.5 83.4
— —
2 3 6 12 24
96.2 98.1 99.4 99.8 100.0
2.2 1.5 0.9 0.6 0.0
85.7 87.6 89.3 90.4 89.2
6.3 3.0 2.4 1.8 1.4
Note: 0-MSP is the maximal stability Perceptron; 0-Billiard is the Bayes billiard estimate.
ber of active examples (those contributing to the solution) was 42 for the MSP and 55 during the billiard. By computing the MSP and the Bayes Perceptron, we did not use any information available on the test set. On the contrary, many networks trained by backpropagation are slightly adjusted to the test set by changing network parameters—output unit biases and/or activation function decision bounds. Other training protocols also allow either such adjustments or generate a population of networks, from which a “best” is chosen based on test set results. Although such adaptive behavior might be advantageous in many practical applications, it can be misleading when trying to infer the real capability of the trained network. In the sonar problem, for instance, we know that a set of Perceptrons separating faultlessly the whole data (training plus test) set is included in the version space. Hence, one could use the billiard or other method to find it. This would be an extreme example of “adapting” our solution to the test set. Such a procedure is especially dangerous when the test set is not “typical.” Since in the sonar problem the data were divided in two equal sets, by exchanging the roles of the training and test sets one would expect similar quality results. However, in this case, much weaker results (73.3% classification rate) are obtained. This shows that the two sets do not contain the same amount of information about the common Perceptron solution. Looking at the results of Table 2 it is hard to understand how 104 training examples could substantiate the excellent average test set error of 90.4±1.8% for networks with 12 hidden units (841 free parameters). The method of
120
P´al Ruj´an
structural risk minimization (Vapnik 1982) uses uniform bounds for the generalization error in networks with firmly different Vapnik-Chervonenkis (1971) dimensions. To establish a classification probability of 90%, one needs about 10 times more training examples for the linearly separable class of functions. A network with 12 hidden units certainly has a much larger capacity and requires that many more examples. One could argue that each of the 12 hidden units has solved the problem on its own, and thus the network acts as a committee machine. However, such a majority decision should be at best comparable to the Bayes estimate.
7 Conclusions and Prospects The study of dynamic systems has led to many interesting practical applications in time-series analysis, coding, and chaos control. The elementary application of Hamiltonian dynamics presented in this paper demonstrates that the center of mass of a long dynamic trajectory bouncing back and forth between the walls of the convex polyhedral solution cone leads to a good estimate of the Bayes decision rule for linearly separable problems. Somewhat similar ideas have been recently applied to constrained nonlinear programming (Coleman and Li 1994). Although the geometric view presented in this paper overemphasizes the role played by extremal (active) vertices of the example set and faultlessly trained Perceptrons, it is not difficult to make these estimates more robust. For example, some of the extremal examples can be removed to test—and improve—the stability of the MSP. Similarly, the solution polyhedron can be expanded to include solutions with nonzero training errors, allowing for learning noisy examples. Since the Perceptron problem is equivalent to the linear inequalities problem (equation 4.1), it has the same algorithmic complexity as linear programming. The theory of convex polyhedra plays a central role in both mathematical programming and solving N P -hard problems such as the traveling salesman problem (Lawler and Lenstra 1984; Padberg and Rinaldi 1987). Viewed from this perspective, the ergodic theory of convex polyhedral billiards might provide new, effective tools for solving different combinatorial optimization problems (Ruj´an in preparation). A big advantage of such ray-tracing algorithms is that they can be run in parallel by following up several trajectories at the same time. The success of further applications depends on methods of making such simple dynamics strongly mixing. On the theoretical side, more general results, applicable to large classes of convex polyhedral billiards, are called for. In particular, a good estimate of the average escape (or typical mixing) time is needed in order to bound the average behavior of future “ergodic” algorithms.
Playing Billiards in Version Space
121
Acknowledgments I thank Manfred Opper for the analytic Bayes data and Bruno Eckhardt for discussions on billiards. For a LATEX original of this article with PostScript figures and a C-implementation of the billiard algorithm, point your browser at http://www.icbm.uni-oldenburg.de/∼ rujan/rujan.html. References Anlauf, J., and Biehl, M. 1989. The AdaTron: An Adaptive Perceptron algorithm. Europhysics Letters 10, 687. Bouten, M., Schietse, J., and Van den Broeck, C. 1995. Gradient descent learning in Perceptrons: A review of possibilities. Physical Review E 52, 1958. Bunimovich, L. A. 1979. On the ergodic properties of nowhere dispersing billiards. Communications in Mathematical Physics 65, 295. Bunimovich, L. A., and Sinai, Y. G. 1980. Markov partitions for dispersed billiards. Communications in Mathematical Physics 78, 247. Coleman, T. F., and Li, Y. 1994. On the convergence of interior-reflective Newton methods for nonlinear minimization subject to bounds. Mathematical Programming 67, 189. Diederich, S., and Opper, M. 1987. Learning of correlated patterns in spin-glass networks by local learning rules. Physical Review Letters 58, 949. Fahlman, S. E. N.d. Collection of neural network data. Included in the public domain software am-6.0. Gorman, R. P., and Sejnowski, T. J. 1988. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1, 75. Hertz, J., Krogh, A., and Palmer, R. G. 1991. Introduction to the Theory of Neural Computation. Addison-Wesley, Reading, MA. Khachiyan, L. G., and Todd, M. J. 1993. On the complexity of approximating the maximal inscribed ellipsoid for a polytope. Mathematical Programming 61, 137. Knerr, S., Personnaz, L., and Dreyfus, G. 1990. Single layer learning revisited: A stepwise procedure for building and training a neural network. In Neurocomputing, F. Fogelman and J. H´erault, eds., Springer-Verlag, Berlin. Krishnamurthy, H. R., Mani, H. S., and Verma, H. C. 1982. Exact solution of Schrodinger ¨ equation for a particle in a tetrahedral box. Journal of Physics A 15, 2131. Kuttler, J. R., and Sigillito, V. G. 1984. Eigenvalues of the Laplacian in two dimensions. SIAM Review 26, 163. Lampert, P. F. 1969. Designing pattern categories with extremal paradigm information. In Methodologies of Pattern Recognition, M. S. Watanabe, ed., p. 359. Academic Press, New York. Lawler, E. L., Lenstra, J. K., Rinnoy Kan, A. H. G., and Shmoys, D. B., eds. 1984. The Traveling Salesman Problem. John Wiley, New York. Marchand, M., Golea, M., and Ruj´an, P. 1989. Convergence theorem for sequential learning in two layer perceptrons. Europhysics Letters 11, 487.
122
P´al Ruj´an
Moser, J. 1980. Geometry of quadrics and spectral theory. In The Chern Symposium, Y. Hsiang et al., eds., p. 147. Springer-Verlag, Berlin. Opper, M. 1993. Bayesian learning. Talk at the Neural Networks for Physicists 3. Minneapolis, MN. Opper, M., and Haussler, D. 1991. Generalization performance of Bayes optimal classification algorithm for learning a Perceptron. Physical Review Letters 66, 2677. Opper, M., and Haussler, D. 1992. Calculation of the learning curve of Bayes optimal classification algorithm for learning a Perceptron with noise. Contribution to IVth Annual Workshop on Computational Learning Theory (COLT91), Santa Cruz 1991, pp. 61–87. Morgan Kaufmann, San Mateo, CA. Padberg, M., and Rinaldi, G. 1987. Optimization of a 532-city symmetric traveling salesman by branch and cut. Oper. Res. Lett. 6, 1. Padberg, M., and Rinaldi, G. 1988. A branch-and-cut algorithm for the resolution of large-scale symmetric traveling salesman problems. IASI preprint R. 247. Polya, ´ G. 1968. Mathematics and Plausible Reasoning. Vol. 2: Patterns of Plausible Inference. 2d ed. Princeton University Press, Princeton, NJ. Rosenblatt, F. 1988. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review 65, 386. Ruj´an, P. 1993. A fast method for calculating the Perceptron with maximal stability. Journal de physique (Paris) I 3, 277. Ruj´an, P. In preparation. Ergodic billiards and combinatorial optimization. Sonnevend, Gy. 1988. New algorithms based on a notion of “centre” (for systems of analytic inequalities) and on rational extrapolation. In Trends in Mathematical Optimization—Proceedings of the 4th French-German Conference on Optimization, Irsee 1986 ISNM, K. H. Hoffmann, J.-B. Hiriart-Urruty, C. Lemarechal, and J. Zowe, eds., vol. 84 Birkhauser, Basel. Sutherland, B. 1985. An introduction to the Bethe Ansatz. In Exactly Solvable Problems in Condensed Matter and Relativistic Field Theory, B. S. Shastry, S. S. Jha, and V. Singh, eds., p. 1. Lecture Notes in Physics 242. Springer-Verlag, Berlin. Vaidya, P. M. 1990. A new algorithm for minimizing a convex function over convex sets. In Proceedings of the 30th Annual FOCS Symposium, Research Triangle Park, NC, 1989, pp. 338–343. IEEE Computer Society Press, Los Alamitos, CA. Vapnik, D. 1982. Estimation of dependencies from empirical data, SpringerVerlag, Berlin. Vapnik, V. N., and Chervonenkis, A. Y. 1971. Theor. Probab. Appl. 16, 264. Watkin, T. 1993. Optimal learning with a neural network. Europhysics Letters 21, 871. Widrow, B., and Hoff, M. E. 1960. Adaptive switching circuits. 1960 IRE WESCON Convention Record. IRE, New York.
Received August 14, 1995; accepted April 24, 1996.
Communicated by Raymond Watrous
Partial BFGS Update and Efficient Step-Length Calculation for Three-Layer Neural Networks Kazumi Saito Ryohei Nakano NTT Communication Science Laboratories, 2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02 Japan
Second-order learning algorithms based on quasi-Newton methods have two problems. First, standard quasi-Newton methods are impractical for large-scale problems because they require N2 storage space to maintain an approximation to an inverse Hessian matrix (N is the number of weights). Second, a line search to calculate a reasonably accurate step length is indispensable for these algorithms. In order to provide desirable performance, an efficient and reasonably accurate line search is needed. To overcome these problems, we propose a new second-order learning algorithm. Descent direction is calculated on the basis of a partial Broydon-Fletcher-Goldfarb-Shanno (BFGS) update with 2Ns memory space (s ¿ N), and a reasonably accurate step length is efficiently calculated as the minimal point of a second-order approximation to the objective function with respect to the step length. Our experiments, which use a parity problem and a speech synthesis problem, have shown that the proposed algorithm outperformed major learning algorithms. Moreover, it turned out that an efficient and accurate step-length calculation plays an important role for the convergence of quasi-Newton algorithms, and a partial BFGS update greatly saves storage space without losing the convergence performance. 1 Introduction The backpropagation (BP) algorithm (Rumelhart et al. 1986) has been applied to various classes of problems, and its usefulness has been proved. However, even with a momentum term, this algorithm often requires a large number of iterations for convergence. Moreover, the user is required to determine appropriate parameters by trial and error. To overcome these drawbacks, a learning rate maximization method (LeCun et al. 1993), learning rate adaptation rules (Jacobs 1988; Silva and Almeida 1990), smart algorithms such as QuickProp (Fahlman 1988) and RPROP (Riedmiller and Braun 1993), and second-order learning algorithms (Watrous 1987; Barnard 1992; Battiti 1992; Møller 1993b) based on nonlinear optimization techniques (Gill et al. 1981) have been proposed. Each has achieved a certain degree of Neural Computation 9, 123–141 (1997)
c 1997 Massachusetts Institute of Technology °
124
Kazumi Saito and Ryohei Nakano
success. Among these approaches, we believe that second-order learning algorithms should be investigated more because they theoretically have excellent convergence properties (Gill et al. 1981). At present, however, they have two problems. One is scalability. Second-order algorithms based on LevenbergMarquardt or quasi-Newton methods cannot suitably increase in scale for large problems. Levenberg-Marquardt algorithms generally require a large amount of computation until convergence, even for midscale problems that involve many hundreds of weights. They require O(N2 m) operations to calculate the descent direction during any one iteration; N denotes the number of weights and m the number of examples. Standard quasi-Newton algorithms (Watrous 1987; Barnard 1992) are rarely applied to large-scale problems that involve more than many thousands of weights because they require N2 storage space to maintain an approximation to an inverse Hessian matrix. In order to cope with this problem, the OSS algorithm (Battiti 1989) adopted the memoryless update (Gill et al. 1981); however, OSS may not provide desirable performance because the descent direction is calculated on the basis of a fast but rough approximation. The other problem is the burden of step-length computation. A line search to calculate an adequate step length is indispensable for second-order algorithms that are based on quasi-Newton or conjugate gradient methods. Since an inaccurate line search may provide undesirable performance, a reasonably accurate line search is needed. However, an exact line search based on existing methods generally requires a nonnegligible computational load, at least a few number of function and/or gradient evaluations. Since it is widely recognized that successful convergence of conjugate gradient methods relies heavily on the accuracy of a line search, it seems rather difficult to increase the speed of conjugate gradient algorithms. In the SCG algorithm (Møller 1993b), although the step length is estimated by using the model trust region approach (Gill et al. 1981), this method can be regarded as an efficient one-step line search method based on a one-sided difference equation. Fortunately, successful convergence of quasi-Newton algorithms is theoretically guaranteed even with an inaccurate line search if certain conditions are satisfied (Powell 1976), but efficient convergence cannot always be expected. Thus, we believe that if the step length can be calculated with greater efficiency and reasonable accuracy, the quasi-Newton algorithms will work better from two aspects: processing efficiency and convergence. For large-scale problems that include a large number of redundant training examples, on-line algorithms (LeCun et al. 1993), which perform a weight update for each single example, will work with greater efficiency than an offline counterpart. Although second-order algorithms are basically not suited to perform on-line updating, waiting for a sweep of many examples before updating is not reasonable. Toward the improvement, several attempts have been made to introduce “pseudo-on-line” updating into second-order algorithms, where updates are performed on smaller subsets of data (Møller
BFGS Update and Step-Length Calculation
125
1993c; Kuhn and Herzberg 1990). Thus, if a better second-order algorithm is developed, its pseudo-on-line version will work more efficiently. This paper is organized as follows. Section 2 describes a new secondorder learning algorithm based on a quasi-Newton method, called BPQ, where the descent direction is calculated on the basis of a partial BFGS update and a reasonably accurate step length is efficiently calculated as the minimal point of a second-order approximation. Section 3 evaluates BPQ’s performance in comparison with other major learning algorithms. 2 BPQ Algorithm 2.1 The Problem. Let {(x1 , y1 ), . . . , (xm , ym )} be a set of examples, where xt denotes an N-dimensional input vector and yt a target value corresponding to xt . In a three-layer neural network, let h be the number of hidden units, wi (i = 1, . . . , h) the weight vector between all the input units and the hidden unit i, and w0 = (w00 , . . . , w0h )T the weight vector between all the hidden units and the output unit; wi0 means a bias term and xt0 is set to 1. Note that aT denotes the transposed vector of a. Hereafter, a vector consisting of all parameters, (wT0 , . . . , wTh )T , is simply expressed as Φ; let N(= nh + 2h + 1) be the dimension of Φ. Then, learning in the three-layer neural network can be defined as the problem of minimizing the following objective function: f ( Φ) =
m 1X (yt − zt )2 , 2 t=1
(2.1)
P where zt = z(xt ; Φ) = w00 + hi=1 w0i σ (wTi xt ). σ (u) represents a sigmoidal function, σ (u) = 1/(1 + e−u ). Note that we do not employ nonlinear transformation at the output unit because it is not essential from the viewpoint of function approximation. 2.2 Quasi-Newton Method. A second-order Taylor expansion of f (Φ + 1Φ) about 1Φ is given as f (Φ) + (∇ f (Φ))T 1Φ + 12 (1Φ)T ∇ 2 f (Φ)1Φ. If ∇ 2 f (Φ) is positive definite, the minimal point of this expansion is given by 1Φ = −(∇ 2 f (Φ))−1 ∇ f (8). Newton techniques minimize the objective function f (Φ) by iteratively calculating the descent direction, 1Φ (Gill et al. 1981). However, since O(N3 ) operations are required to calculate (∇ 2 f (Φ))−1 directly, we cannot expect these techniques to increase suitably in scale for large problems. Quasi-Newton techniques, on the other hand, calculate a matrix H through iterations in order to approximate (∇ 2 f (Φ))−1 . The basic algorithm is described as follows (Gill et al. 1981): Step 1: Initialize Φ1 , set H1 = I (I: identity matrix), and set k = 1. Step 2: Calculate the current descent direction: 1Φk = −Hk gk , where gk = ∇ f (Φk ).
126
Kazumi Saito and Ryohei Nakano
Step 3: Step 4: Step 5: Step 6: Step 7:
Terminate the iteration if a stopping criterion is satisfied. Calculate the step length λk that minimizes f (Φk + λ1Φk ). Update the weights: Φk+1 = Φk + λk 1Φk . If k ≡ 0 (mod N), set Hk+1 = I; otherwise, update Hk+1 . Set k = k + 1, return to Step 2.
2.3 Existing Methods for Calculating Descent Directions. Several methods for updating Hk+1 have been proposed. Among them, the BroydonFletcher-Goldfarb-Shanno (BFGS) update (Fletcher 1980) was the most successful update in a number of studies. By putting pk = λk 1Φk and qk = gk+1 − gk , the BFGS formula is given as Hk+1 = Hk −
pk qTk Hk + Hk qk pTk pTk qk
à + 1+
qTk Hk qk pTk qk
!
pk pTk pTk qk
.
(2.2)
In large-scale problems that involve more than many thousands of weights, maintaining the approximation matrix H becomes impractical because it requires N2 storage space. In order to cope with this problem, the OSS (one-step secant) algorithm (Battiti 1989) adopted the memoryless BFGS update (Gill et al. 1981). By always taking the previous matrix Hk as the identity matrix (Hk = I), the descent direction of Step 2 is calculated as à ! pk qTk gk+1 + qk pTk gk+1 qTk qk pk pTk gk+1 − 1+ T .(2.3) 1Φk+1 = −gk+1 + pTk qk pk qk pTk qk Clearly equation 2.3 can be calculated with O(N) multiplications and storage space. However, OSS may not provide desirable performance because the descent direction is calculated on the basis of the above substitution. In our early experiments, this method worked rather poorly in comparison with the partial BFGS update proposed in the next section. 2.4 New Method for Calculating Descent Directions. In this section, we propose a partial BFGS update with 2Ns memory (s ¿ N), where the search directions are exactly equivalent to those of the original BFGS update during the first s + 1 iterations. The partiality parameter s means the length of the history described below. By putting rk = Hk qk , we have Hk gk+1 = Hk qk + Hk gk = rk −
pk . λk
Then, the descent direction based on equation 2.2 can be calculated as 1Φk+1 = −Hk+1 gk+1
BFGS Update and Step-Length Calculation
pk rTk gk+1 + rk pTk gk+1 pk + λk pTk qk à ! qTk rk pk pTk gk+1 − 1+ T . pk qk pTk qk
127
= −rk +
(2.4)
After calculating rk , equation 2.4 can be calculated using O(N) multiplications and storage space, just like the memoryless BFGS update. Now, we show that it is possible to calculate rk within O(Ns) multiplications and 2Ns storage space by using the partial BFGS update. Here, we assume k ≤ s. When k = 1, r1 (= H1 q1 = g2 − g1 ) can be calculated only by subtractions. When k > 1, we assume that each of r1 , . . . , rk−1 has been calculated and stored. Note that for i < k, 1 αi = T and βi = αi (1 + αi qTi ri ) pi qi have already been calculated during the iterations. Thus, by recursively applying equation 2.2, rk can be calculated with O(Nk) multiplications and 2Nk storage space, as follows: rk = Hk qk = Hk−1 qk − αk−1 pk−1 rTk−1 qk − αk−1 rk−1 pTk−1 qk + βk−1 pk−1 pTk−1 qk = qk +
k−1 X
(−αi pi rTi qk − αi ri pTi qk + βi pi pTi qk ) .
(2.5)
i=1
Next, when the number of iterations exceeds s+1, we have two alternatives: restarting the update by discarding the accumulated vectors or continuing the update by using the latest vectors. In both cases, equation 2.5 can be calculated within O(Ns) multiplications and 2Ns storage space. Therefore, it has been shown that equation 2.4 can be calculated within O(Ns) multiplications and 2Ns storage space. In our experiments, the former update was employed because the latter worked poorly when s was small. Although the idea of partial quasi-Newton methods has been briefly described (Luenberger 1984), strictly speaking, our update is different. The earlier proposal intended to store the vectors pi and qi , but our update stores the vectors pi and ri . The immediate advantage of this partial update over the original BFGS update is the applicability to large-scale problems. Even if N is very large, by setting s to an adequate small value with respect to the amount of available storage space, our update will work. On the other hand, the probable advantage over the memoryless BFGS update is the superiority of the convergence property. However, this claim should be examined through a wide range of experiments. Note that when s = 0, the partial BFGS update always gives the gradient direction; when s = 1, it corresponds to the memoryless BFGS update.
128
Kazumi Saito and Ryohei Nakano
2.5 Existing Methods for Calculating Step Lengths. In Step 4, since λ is the only variable in f , we can express f (Φ+λ1Φ) simply as ζ (λ). Calculating λ, which minimizes ζ (λ), is called a line search. Among a number of possible line search methods, we considered the following typical three methods: a fast but inaccurate one, a moderately accurate one, and an exact but slow one. The first method is explained below. By using quadratic interpolation (Gill et al. 1981; Battiti 1989), we can derive a fast line search method that only guarantees ζ (λ) < ζ (0). Let λ1 be an initial value for a step length. If ζ (λ1 ) < ζ (0), λ1 becomes the resultant step length; otherwise, by considering a quadratic function h(λ) that satisfies the conditions h(0) = ζ (0), h(λ1 ) = ζ (λ1 ), and h0 (0) = ζ 0 (0), we get the following approximation of ζ (λ): ζ (λ) ≈ h(λ) = ζ (0) + ζ 0 (0)λ +
ζ (λ1 ) − ζ (0) − ζ 0 (0)λ1 2 λ . λ21
Since ζ (λ1 ) ≥ ζ (0) and ζ 0 (0) < 0, the minimal point of h(λ) is given by λ2 = −
ζ 0 (0)λ21 . 2(ζ (λ1 ) − ζ (0) − ζ 0 (0)λ1 )
(2.6)
Note that 0 < λ2 < λ1 is guaranteed by equation 2.6. Thus, by iterating this process until ζ (λν ) < ζ (0), we can always find a λν that satisfies ζ (λν ) < ζ (0), where ν denotes the number of iterations. Here, the initial value of λ1 is set to 1 because the optimal step length is near 1 when H closely approximates (∇ 2 f (Φ))−1 . Hereafter, a quasi-Newton algorithm based on the original or partial BFGS update in combination with this fast but inaccurate line search is called BFGS1. By using quadratic extrapolation (Fletcher 1980), we can derive a moderately accurate line-search method that guarantees ζ 0 (λν ) > γ1 ζ 0 (λν−1 ) as well as ζ (λν ) < ζ (λν−1 ), where γ1 is a small constant (e.g., 0.1). Namely, when λν does not satisfy the stopping criterion, if ζ (λν ) ≥ ζ (λν−1 ), λν+1 is calculated using equation 2.6; otherwise, by considering an extrapolation to the slopes ζ 0 (λν−1 ) and ζ 0 (λν ), the appropriate expression for the minimizing value λν+1 is λν+1 = λν − ζ 0 (λν )
λν − λν−1 . ζ 0 (λν ) − ζ 0 (λν−1 )
(2.7)
However, if ζ 0 (λν ) ≤ ζ 0 (λν−1 ), then λν+1 does not exist; thus, λν+1 is set to λν−1 +γ2 (λν −λν−1 ), where γ2 is an adequate value (e.g., 9). In this method, λ0 is set to 0, and λ1 is estimated as min(1, −2ζ 0 (0)−1 ( f (Φcurrent )− f (Φprevious ))). Note that to use this estimate of λ1 for the first method is not good strategy because it does not have an extrapolation process, but λ1 can be a very small value. Hereafter, a quasi-Newton algorithm based on the original or partial
BFGS Update and Step-Length Calculation
129
BFGS update in combination with this moderately accurate line search is called BFGS2. An exact but slow line search method can be constructed by iteratively using the BFGS2 line search method until a stopping criterion is met, for example, kζ 0 (λν )k < 10−8 . The difference from the BFGS2 method is that equation 2.7 is used for interpolation as well as extrapolation. Hereafter, a quasi-Newton algorithm based on the original or partial BFGS update in combination with this exact but slow line search is called BFGS3. 2.6 New Method for Calculating Step Lengths. Here we propose a new method for calculating a reasonably accurate step-length λ in Step 4. 2.6.1 Basic Procedure. given as
A second-order Taylor approximation of ζ (λ) is
1 ζ (λ) ≈ ζ (0) + ζ 0 (0)λ + ζ 00 (0)λ2 . 2 When ζ 0 (0) < 0 and ζ 00 (0) > 0, the minimal point of this approximation is given by λ=−
ζ 0 (0) ζ 00 (0)
µ = −
(∇ f (Φ))T 1Φ (1Φ)T ∇ 2 f (Φ)1Φ
¶ .
(2.8)
Other cases will be considered in the next section. For the three-layer neural networks defined by equation 2.1, we can efficiently calculate ζ 0 (0) and ζ 00 (0) as follows. By differentiating ζ (λ) and substituting 0 for λ, we obtain ζ 0 (0) = −
m m X X (yt − zt )z0t and ζ 00 (0) = ((z0t )2 − (yt − zt )z00t ). t=1
t=1
Now that the derivative of zt = z(xt ; Φ) is defined as we obtain z0t = 1w00 +
h X i=1
(1w0i σit + w0i σit0 ) and z00t =
d dλ z(xt ; Φ + λ1Φ)|λ=0 ,
h X (21w0i σit0 + w0i σit00 ), i=1
where σit = σ (wTi xt ), σit0 = σit (1 − σit )(1wi )T xt , σit00 = σit0 (1 − 2σit )(1wi )T xt , and 1wij denotes the change of wij , calculated in Step 2. Now we consider the computational complexity of calculating the step length using equation 2.8. Clearly, (1wi )T xt must be calculated for each pair of hidden unit i and input xt ; thus, since the number of hidden units is h and the number of inputs is m, at least nhm multiplications are required. Since the order of multiplications required to calculate the remainder is O(hm), and N = nh + 2h + 1, the total complexity of the calculation is Nm + O(hm).
130
Kazumi Saito and Ryohei Nakano
This algorithm can be generalized to multilayered networks with other differentiable activation functions; thus, it is applicable to recurrent networks (Rumelhart et al. 1986). Consider a hidden unit τ connected to the output layer. We assume that its output value is defined by v = a(wT u), where u is the output values of the units connected to the unit τ , w equals the weights attached to these connections, and a(·) is an activation function. Then the first- and second-order derivatives of v with respect to λ are calculated as v0 = a0 (wT u)(1wT u + wT u0 ), v00 = a00 (wT u)(1wT u + wT u0 )2 + a0 (wT u)(21wT u0 + wT u00 ). Thus, by successively applying these formulas in reverse, we can calculate the step length for multilayered networks. Note that if ui is an output value of an input unit, then u0i = u00i = 0. 2.6.2 Coping with Undesirable Cases. In the above, we assumed ζ 0 (0) < 0. When ζ 0 (0) > 0, the value of the objective function cannot be reduced along the search direction; thus, we set 1Φk to −∇ f (Φk ) and restart the update by discarding the accumulated vectors (p, r). Note that ζ 0 (0) < 0 is guaranteed by such a setting unless k∇ f (Φk )k = 0 because ζ 0 (0) = (∇ f (Φk ))T 1Φk = −k∇ f (Φk )k2 < 0. When ζ 0 (0) < 0 and ζ 00 (0) ≤ 0, equation 2.8 gives a negative value or infinity. To avoid this situation, we employ the Gauss-Newton technique. The first-order approximation of zt = z(xt ; Φ + λ1Φ) is zt + z0t λ. Then, ζ (λ) of the next iteration can be approximated by ζ (λ) ≈
m m 1X 1X (yt − (zt + z0t λ))2 = ζ (0) + ζ 0 (0)λ + (z0 )2 λ2 . 2 t=1 2 t=1 t
The minimal point of this approximation is given by ζ 0 (0) λ = − Pm 0 2 . t=1 (zt )
(2.9)
Clearly, equation 2.9 always gives a positive value when ζ 0 (0) < 0. In many cases, it is useful from a practical sense to limit the maximum change in Φ, which should be done during any one iteration (Gill et al. 1981). Here, if kλ1Φk > 1.0, λ is set to k1Φk−1 . Since λ is calculated on the basis of the approximation, we cannot always reduce the value of the objective function, ζ (λ). When ζ (λ) ≥ ζ (0), we employ the fast line search given by equation 2.6. 2.6.3 Summary of Step-Length Calculation. By integrating the above procedures, we can specify Step 4 as follows.
BFGS Update and Step-Length Calculation
131
Step 4.1: If ζ 0 (0) > 0, set 1Φk = −∇ f (Φk ) and k = 1. Step 4.2: If ζ 00 (0) > 0, calculate λ using equation 2.8; otherwise, calculate λ using equation 2.9. Step 4.3: If kλ1Φk k > 1.0, set λ = k18k k−1 . Step 4.4: If ζ (λ) > ζ (0), calculate λ using equation 2.6 until ζ (λ) < ζ (0). Hereafter, the quasi-Newton algorithm based on our partial BFGS update in combination with our step-length calculation is called BPQ (Bp based on partial Quasi-Newton). Incidentally, a modified OSS algorithm, OSS2 (a combination of the memoryless BFGS update and our step-length calculation), may be another good algorithm and will be evaluated in the experiments. 2.7 Computational Complexity. We consider the computational complexity of BPQ and other algorithms with respect to one-iteration in which every training example is presented once. In off-line BP, the complexity (number of multiplications) to calculate the objective function is nhm + O(hm) and the complexity for the gradient vector is nhm + O(hm). Thus, since N = nh + 2h + 1, the complexity for off-line BP is 2Nm + O(hm). Here, the complexity for a weight update is just N (or 2N if momentum term is used), and is safely negligible. In this article, we define one-iteration of on-line BP as m updates sweeping all examples once. In addition to the above complexity, since on-line BP performs a weight update for each single example, the learning rate is multiplied to each element of the gradient vectors Nm times in one-iteration; thus, the complexity for on-line BP is 3Nm + O(hm). In the case of on-line BP with momentum term, the momentum factor is also multiplied to each element of the previous modification vectors Nm times in one-iteration; thus, the complexity for on-line momentum BP is 4Nm + O(hm). In addition to the complexity for off-line BP, BPQ calculates the descent direction based on the partial BFGS update with a history of at most s iterations and also calculates the step length. The complexity of the former calculation is O(Ns) (see Section 2.4) and that of the latter is Nm + O(hm) (see Section 2.6.1). Here, note that the computational complexity to calculate the objective function can be reduced from Nm + O(hm) to O(hm) for threelayer networks. This is because in the next iteration, the output value of each hidden unit is given by σ (wTi xt + λ(1wi )T xt ), but (1wi )T xt is already calculated when the step length is calculated. Thus, the total complexity for BPQ is 2Nm + O(Ns) + O(hm). To reduce the generalization error for an unseen example, m should be larger than N. Since s is smaller than N, the complexity of O(Ns) usually becomes much smaller than that of 2Nm, and the complexity for BPQ remains almost equivalent to that of off-line BP. A general method for calculating the denominator of equation 2.8 has been proposed (Pearlmutter 1994; Møller 1993a); after calculating ∇ 2 f (Φ)1Φ,
132
Kazumi Saito and Ryohei Nakano
the denominator is calculated by using an inner product. The result is mathematically the same as our step-length calculation, but the computational complexity of this method is much larger than that of our method, at least in the case of three-layer networks as shown below. By using Pearlmutter’s ∂ f (8+λ1Φ)|λ=0 (Pearlmutter operator, which is defined by <1Φ { f (Φ)} = ∂λ ∂ 1994), we can see that <1Φ { ∂wij f (Φ)} is an element of ∇ 2 f (Φ)1Φ because <1Φ {∇ f (Φ)} = ∇ 2 f (Φ)1Φ. Now, considering the case of a weight between the output unit and the hidden unit i, which is expressed as w0i , we obtain ( ) ¾ ½ m X ∂ f (Φ) = <1Φ − (yt − zt )σit <18 ∂w0i t=1 =
m X (<1Φ {zt }σit − (yt − zt )<18 {σit }), t=1
P where <1Φ {zt } = 1w00 + hi=1 (1w0i σit + w0i <1Φ {σit }) and <1Φ {σit } = σit (1 − σit )(1wi )T xt . Clearly, just like our method, (1wi )T xt must be calculated for each pair of hidden unit i and input xt ; thus, a minimum of nhm multiplications is required. Additionally, a complexity of at least O(m) is required for calculating <18 { ∂w∂ 0i f (Φ)}. Therefore, since a similar argument can be applied to other weights and the total number of the weights is N, the complexity of the existing method becomes nhm + O(Nm). On the other hand, the complexity of our method is nhm + O(hm), which is more efficient because N = nh + 2h + 1. Although the SCG (scaled conjugate gradient) algorithm (Møller 1993b) estimates the step length by using the model trust region approach, this method can be regarded as an efficient one-step line search method based on a one-sided difference equation. The step length calculated from equation 2.8 is approximated by ∇ 2 f (Φ)1Φ ≈
∇ f (Φ + δ1Φ) − ∇ f (Φ) , δ
where δ is defined as δ = δ0 k1Φk−1 and δ0 is a small constant. Clearly, the complexity of SCG is approximately double that of BP because the gradient vector ∇ f (Φ + δ1Φ) must be calculated during one-iteration. If the difference parameter δ is approximately equal to the optimal step-length λ, SCG will provide a more faithful picture of the error surface than our method; however, since our purpose is to estimate the optimal step-length λ, we generally do not know such δ. 3 Experiments This section evaluates BPQ’s performance in comparison with many other learning algorithms through experiments that use three types of problems.
BFGS Update and Step-Length Calculation
133
Figure 1: Two-weight problem.
3.1 Two-Weight Problem. In order to evaluate learning efficiency graphically, we designed a simple learning problem where only two weights were adjustable. Figure 1 describes the problem. In a three-layer network, each layer consists of only one unit, and the weight between the hidden unit and the output unit is fixed at 1. In this problem, the weight between the input and hidden units is expressed as w1 , the weight corresponding to the bias term at the output unit is expressed as w2 , and an activation function of a hidden unit is assumed to be σ (x) = 1/(1 + exp(−x)). Each target value yt shown in Figure 1 was calculated from the corresponding input value xt by setting (w1 , w2 ) = (1, 0). Thus, the global minimum point was given by these weight values. In this experiment, BPQ was compared with eight other learning algorithms: on-line BP with momentum term, off-line BP, off-line BP with momentum term, BP with learning rate adaptation (Silva and Almeida 1990), SCG (Møller 1993b), BFGS1, BFGS2, and BFGS3. Note that the memoryless BFGS update is equivalent to the original BFGS update when N = 2. The parameters of each algorithm were determined by trial and error or set to the values recommended by the inventors (Silva and Almeida 1990; Møller 1993b). For on-line momentum BP, the learning rate and the momentum factor were set to η = 0.05 and α = 0.9. For off-line BP, the learning rate η was set to 0.1. For off-line momentum BP, the learning rate and the momentum factor were set to η = 0.025 and α = 0.9. For adaptive BP, the increase and the decrease factors were set to u = 1.1 and d = u−1 , and the initial update value η0 was set to 0.01. For SCG, the constant δ0 was set to 10−4 , and the initial scaling parameter λ1 was set to 10−6 . Figures 2 and 3 show the learning trajectories on the error surface with respect to w1 and w2 during 100 iterations starting at (w1 , w2 ) = (−0.5, 0), where MSE stands for mean squared error: MSE =
5 1X (yt − (σ (w1 xt ) + w2 ))2 . 5 t=1
(These experiments were done on a Power Macintosh 8100/100 computer.)
134
Kazumi Saito and Ryohei Nakano
Figure 2: Learning trajectories of first-order algorithms.
Figure 2 shows that the search of these first-order algorithms went very slowly along the curved valley floor. Up to 100 iterations, any first-order algorithm was unable to reach the minimum point. This may be due to the inherent difficulty of first-order methods; since the directions of the successive gradient vectors are almost opposite near the valley floor, even learning rate adaptation by sign changes of the last two gradients cannot improve the learning efficiency. Note that the trajectories of BPs oscillate widely when a larger η is used. On the other hand, Figure 3 shows that SCG, BFGS1, BFGS2, BFGS3, and BPQ found the minimum point within 100 iterations. In comparison with BPQ, SCG required a larger number of iterations before reaching the minimum point. BFGS1 required the largest number of iterations, showing that the accuracy of the step-length calculation is important for solving this problem. BFGS2 worked very well, almost identical to BPQ. The number of iterations for BFGS3 was slightly smaller than that for BPQ, but BFGS3 required much more computation time due to an exact line search. These
BFGS Update and Step-Length Calculation
135
Figure 3: Learning trajectories of second-order algorithms.
experimental results indicate that second-order methods will work well even if an error surface forms a curved valley floor. 3.2 Parity Problem. Since parity problems have been widely used as benchmarks, the eight-bit parity problem was used to compare BPQ with seven other algorithms: on-line momentum BP, adaptive BP, SCG, OSS2, BFGS1, BFGS2, and BFGS3. Since the scale of the problem was quite small, BPQ employed the original BFGS update; thus s = N. In the experiments, we used a three-layer network with eight hidden units (h = 8, N = 82).
136
Kazumi Saito and Ryohei Nakano
Figure 4: Convergence for the 8-bit parity problem.
All possible input patterns were used as training examples (m = 256), and the target values were set to 1 or 0. The parameters of each algorithm were exactly the same as in the previous experiment, except that the learning rate η was set to 0.1 for on-line momentum BP. In all the algorithms, the initial values for the weights were randomly generated in the range of [−1, 1]. In the experiments, the maximum CPU processing time was set to 100 seconds. The iteration was terminated when k∇ f (w)k < 10−8 . (These experiments were done on an HP9000/735 computer.) Figure 4a shows convergence property of BPQ and first-orderp algorithms with respect to the average RMSE (root mean squared error): 2 f (w)/m) and the standard deviation over 100 trials for each algorithm. Among these algorithms, BPQ converged the quickest, allowing us to stop the iteration at an early stage. Figure 4b compares convergence property of BPQ with second-order algorithms. In comparison with BPQ, the convergence of the other second-order algorithms was slow. This shows that the step-length calculation plays an important role for quasi-Newton methods.
BFGS Update and Step-Length Calculation
137
3.3 Practical Problem. In order to evaluate how BPQ works on largescale problems, we performed experiments on a speech synthesis problem by using a data set used for parrot-like speaking (Nakano et al. 1995). This learning problem is a sort of nonlinear autoregression; the input vector consists of a feature vector (10 values) and an acoustic wave (8 values), and the target output value is the next value of the acoustic wave; the total number of input units is 18 (n = 18). In these experiments, the number of hidden units was set to 36 (h = 36); thus, the number of weights was 721 (N = 721). By using 12,800 training examples (m = 12, 800), BPQ was compared with eight other algorithms: on-line BP, on-line momentum BP, adaptive BP, SCG, OSS2, BFGS1, BFGS2, and BFGS3. In the algorithms, the initial values for the weights between the input and hidden units were randomly generated in the range of [−1, 1]. The values for the weights between the hidden and output units were set to 0, but the bias value at the output unit was set to the average output value of all training examples. In the experiments, the maximum number of iterations was set to 100, and the average RMSE was used to evaluate the convergence property. (These experiments were also done on an HP9000/735 computer.) Figure 5a compares BPQ with the first-order algorithms using the convergence property with respect to the average RMSE and standard deviation over 10 trials for each algorithm. Note that standard deviation for adaptive BP is not shown because it was always 0. For on-line BP, the learning rate η was set to 1.0, the best value in our experiments; however, the learning curve oscillated widely. For on-line momentum BP whose momentum factor α was fixed at 0.9, the learning rate η was set to 0.1, the best value in our experiments; although the learning curve did not oscillate widely, the convergence property was very similar to that of on-line BP. For adaptive BP, the learning rate η was set to 0.1 or 1.0; both values gave similar and rather poor results. For BPQ, the parameter s was set to 50; convergence was the quickest, allowing us to stop the iteration at an early stage. Figure 5b compares BPQ with other second-order algorithms using the average values over 10 trials for each algorithm. In comparison to BPQ, BFGS1 worked rather poorly; BFGS3 required more processing time to achieve the level equal to the best RMSE of BPQ, while BFGS2 worked better than BFGS3, rather close to BPQ. Figure 5c compares the average one-iteration processing time of all algorithms. One-iteration time of OSS2 or BPQ was almost equivalent to that of adaptive BP. On-line BP was about 1.3 times slower due to weight updates for each single example; on-line momentum BP was almost twice as slow due to additional multiplication of the momentum factor. BFGS1 required a half more processing since it sometimes required a few function evaluations in order to shorten the step lengths; BFGS2 was almost four times slower due to an additional line search calculation to meet its stopping criterion; BFGS3 was almost seven times slower due to an exact line search calculation. SCG was almost twice as slow due to an additional gradient calculation.
138
Kazumi Saito and Ryohei Nakano
Figure 5: Convergence for a practical problem.
Figure 6 shows the influence of the partiality parameter s on the convergence property for each of the four line search algorithms. Each curve is drawn using the average RMSE over 10 trials. In general, the inference was small compared with the difference of line search methods. BPQ with s = 5 worked as nicely as with s = 50, while BPQ with s = 2 worked poorly compared with BPQ with s ≥ 5. Consequently, it was shown that an efficient and accurate step-length calculation plays an important role for the convergence of quasi-Newton algorithms, while the partial BFGS update with a proper s greatly saves storage space without losing the convergence performance.
BFGS Update and Step-Length Calculation
139
Figure 6: Effect of partiality parameter s.
4 Conclusion We have proposed a small-memory efficient second-order learning algorithm called BPQ for three-layer neural networks. In BPQ, the search direction is calculated on the basis of a partial BFGS update, and a reasonably accurate step length is efficiently calculated as the minimal point of a second-order approximation. In our experiments that used a two-weight problem, a parity problem, and a speech synthesis problem, BPQ worked better than major learning algorithms. Moreover, it turned out that an efficient and accurate step-length calculation plays an important role for the convergence of quasi-Newton algorithms, while the partial BFGS update greatly saves storage space without losing the convergence performance. In the future, we plan to do further comparisons using a wider variety of problems, including large-scale classification problems.
Acknowledgments We thank anonymous referees for many helpful suggestions and comments.
140
Kazumi Saito and Ryohei Nakano
References Barnard, E. 1992. Optimization for training neural nets. IEEE Trans. Neural Networks 3(2), 232–240. Battiti, R. 1989. Accelerating back-propagation learning: Two optimization methods. Complex Systems 3(4), 331–342. Battiti, R. 1992. First- and second-order methods for learning between steepest descent and Newton’s method. Neural Comp. 4(2), 141–166. Fahlman, S. 1988. Faster-learning variations on back-propagation: an empirical study. In Proceedings of the 1988 Connectionist Models Summer School, pp. 38–51. San Mateo, CA. Fletcher, R. 1980. Practical Methods of Optimization. Vol. 1. John Wiley, New York. Gill, P., Murray, W., and Wright, M. 1981. Practical Optimization. Academic Press, London. Jacobs, R. 1988. Increased rates of convergence through learning rate adaptation. Neural Networks 1(4), 295–307. Kuhn, G., and Herzberg, P. 1990. Some variations on training recurrent neural networks. In Proceedings of CAIP Neural Networks Workshop, pp. 15–17, Rutgers University, New Brunswick, NJ. LeCun, Y., Simard, P., and Pearlmutter, B. 1993. Automatic learning rate maximization by on-line estimation of the hessian’s eigenvectors. In Neural Information Processing Systems 5, S. Hanson, J. Cowan, and C. Giles, eds., pp. 156– 163. Morgan Kaufmann, San Mateo, CA. Luenberger, D. 1984. Linear and Nonlinear Programming. Addison-Wesley, Reading, MA. Møller, M. 1993a. Exact calculation of the product of the Hessian matrix of feedforward network error functions and a vector in O(N) time. Tech. Rep. DAIMI PB-432, Computer Science Department, Aarthus University. Møller, M. 1993b. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 6(4), 525–533. Møller, M. 1993c. Supervised learning on large redundant training sets. International J. Neural Systems 4(1), 15–25. Nakano, R., Ueda, N., Saito, K., and Yamada, T. 1995. Parrot-like speaking using optimal vector quantization. In Proc. IEEE International Conference on Neural Networks, Parse, Australia. Pearlmutter, B. 1994. Fast exact multiplication by the Hessian. Neural Comp. 6(1), 147–160. Powell, M. 1976. Some global convergence properties of a variable metric algorithm for minimization without exact line searches. In Nonlinear Programing, SIAM-AMS Proceedings, Vol. 9, Providence, RI. Riedmiller, M., and Braun, H. 1993. A direct adaptive method for faster backpropagation learning: The rprop algorithm. In Proc. IEEE International Conference on Neural Networks, San Francisco. Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, D. Rumelhart and J. McClelland, eds., pp. 318–362. MIT Press, Cambridge, MA.
BFGS Update and Step-Length Calculation
141
Silva, F., and Almeida, L. 1990. Speeding up backpropagation. In Advanced Neural Computers, R. Eckmiller, ed., pp. 151–160. North-Holland, Amsterdam. Watrous, R. 1987. Learning algorithms for connectionist networks: Applied gradient methods of nonlinear optimization. In Proc. IEEE International Conference on Neural Networks, pp. II619–627, San Diego, CA.
Received May 22, 1995; accepted April 8, 1996.
Communicated by Frederico Girosi
Neural Networks for Functional Approximation and System Identification H. N. Mhaskar Department of Mathematics, California State University, Los Angeles, CA 90032 USA
Nahmwoo Hahm Department of Mathematics, University of Texas at Austin, Austin, TX 78712 USA
We construct generalized translation networks to approximate uniformly a class of nonlinear, continuous functionals defined on Lp ([−1, 1]s ) for integer s ≥ 1, 1 ≤ p < ∞, or C([−1, 1]s ). We obtain lower bounds on the possible order of approximation for such functionals in terms of any approximation process depending continuously on a given number of parameters. Our networks almost achieve this order of approximation in terms of the number of parameters (neurons) involved in the network. The training is simple and noniterative; in particular, we avoid any optimization such as that involved in the usual backpropagation. 1 Introduction One of the important applications of neural networks is to approximate functions defined on a compact subset of a Euclidean space in a highly parallel manner. (We give precise definitions in Section 2.) Following the well-known initial results of Cybenko (1989), Hornik et al. (1989), Park and Sandberg (1991), and Barron (1993), among others, there has been a great deal of activity in this area in recent years. Motivated by the work of Girosi et al. (1995), Mhaskar and Micchelli (1995) have introduced the notion of generalized translation networks, unifying the theme of neural networks, radial basis function networks, and generalized regularization networks of Girosi et al. (1995). They studied in detail how the choice of the activation function affects the degree of approximation of the target function. Subsequently, Mhaskar (1996) has constructed generalized translation networks utilizing specific activation functions to obtain an optimal order of approximation when the only a priori information about the target function is that it belongs to some Sobolev class. In this article, we obtain analogous results when a generalized translation network is used to approximate functionals on compact subsets of an L p space rather than functions of finitely many real variables. The problem of approximating a functional arises naturally in the study of identification and control of nonlinear systems. In an identification problem, the hidden Neural Computation 9, 143–159 (1997)
c 1997 Massachusetts Institute of Technology °
144
H. N. Mhaskar and Nahmwoo Hahm
states of a nonlinear system are not known, and one wishes to develop a model merely by observing the input-output relationships of such a system (cf. Levin and Narendra 1995 for a recent discussion). The output at any given time is a functional of the input signal, which is a function of time. Similarly, a control problem can be thought of as a functional, which at any given time maps the input control signals to an output signal. The actual system itself is often unknown, and a simple model is desirable (cf. Sontag N.d. for a recent review). Narendra and Parthasarathy (1990) have proposed different models that can utilize the approximation capabilities of neural networks for system identification and control. Levin and Narendra (1995) have studied neural networks for the identification of systems using only observations on the input-output relationships of the system. In both of these papers, the output signal at any time is considered as a function of finitely many samples of the input and output signals, typically taken prior to the moment in question. A different approach has been suggested by Sandberg (1991b), where the system is thought of directly as a functional acting on the input signal, and an NL system is used as a model for this functional. Here, N is a static neural network, L has a simple representation in terms of bounded linear functionals, and the model is the composite map N ◦ L. Sandberg has shown that any real continuous functional on any compact subset of a real normed linear space can be uniformly approximated in this way and has given several related results concerning, for example, the choice of the linear functional L that can be used. He has also studied the case when radial (or elliptic) basis function networks are used instead of the conventional neural networks. Further research in this direction has also been done by Chen and Chen (1993, 1995a, 1995b), Dingankar (1995), and Modha and Hecht-Nielsen (1993). In this article, we investigate the question of obtaining bounds on the size of the networks involved in terms of the desired accuracy of approximation. As expected, the bound will depend on the properties of the compact subset of the function space on which the functional acts and also on the structural properties of the functional itself. Typically, both of these are likely to be unknown. Therefore, one is led to the notion of universal approximation— approximating the functional under minimal a priori assumptions on the functional and the input signals on which it acts. We prove that there are some inherent lower bounds for the accuracy of such universal approximation. We will also construct generalized translation networks that almost achieve this lower bound. The networks to be constructed are very general and include as special cases gaussian networks, thin-plate spline networks, neural networks with the hyperbolic tangent activation function, and a variety of other commonly used networks. Although our focus is to obtain a theoretical insight into the problem, our proof suggests an explicit formula for the networks. There is no training involved in the usual sense. In particular, we do not use any optimization technique, such as minimization of a
Neural Networks for Functional Approximation
145
backpropagation error surface. Thus, the large number of neurons required for our networks is offset by the extremely simple “training,” and we do not encounter any problems commonly associated with optimization, such as local minima. In the next section, we state and discuss our results. The proofs are given in Section 3. 2 Main Results Following the approach in the papers of Sandberg (1991a, 1991b), we think of an arbitrary nonlinear system as a continuous functional F acting on a compact subset of some L p space. Let s ≥ 1 be an integer, which will be fixed throughout the rest of this paper. If A ⊂ Rs is a (Lebesgue-) measurable set and f : A → R is a measurable function, we write
k f kp,A :=
½Z ¾1/p | f (t)|p dt , A
ess sup | f (t)|,
if 1 ≤ p < ∞, if p = ∞.
(2.1)
t∈A
The space L p (A) is defined to be the class of all functions f : A → R for which k f kp,A < ∞. As usual, we consider two functions to be equal if they are equal almost everywhere in the measure theoretic sense. In this article, if A = [−1, 1]s , then we will omit the mention of this set from the notation; thus, we write k f kp to denote k f kp,[−1,1]s and so forth. Further, we will have no occasion to consider functions in L∞ (A) that are not continuous. Therefore, we will simplify our notations by using L∞ (A) to denote also the class C(A), consisting of bounded, uniformly continuous functions on A. We also recall a few facts about compact subsets of L p . There are several known characterizations of compact sets in L p (A) (Sobolev 1963); for illustrating our theorems, we find the following characterization useful, where we restrict our attention to the case when A = [−1, 1]s . For any real number λ > 0, we denote the class of all algebraic polynomials in s variables of coordinatewise degree not exceeding λ by 5λ,s . If f ∈ L p and λ ≥ 0 we write Eλ,p,s ( f ) := inf k f − Pkp .
(2.2)
P∈5λ,s
It is not too difficult to verify (cf. Lorentz 1953, p. 33) that a closed set K ⊂ L p is compact if and only if there is a constant MK > 0 and a sequence of positive numbers {²m,K }, converging to 0 as m → ∞, such that for every f ∈ K, k f kp ≤ MK ,
Em,p,s ( f ) ≤ ²m,K ,
m = 0, 1, . . . .
(2.3)
For example, for the set of functions satisfying a Holder ¨ condition of order α, one may choose ²m,K = cm−α for a suitable constant c.
146
H. N. Mhaskar and Nahmwoo Hahm
Given a system defined by a functional F on a compact set K ⊂ L p , the objective of system identification is to build another functional G on this compact set that will approximate F well and also serve as a model with known parameters. Although the problem arises because F is typically unknown, it is reasonable to make some a priori assumptions on a class of functionals to which F may belong, as is done commonly in the theory of approximation of functions. Moreover, if we wish to consider the system not as a functional but as an operator T producing an output signal Tf on receiving an input signal f , this operator can be thought of as a set of functionals: Each t in the domain of the output signal defines a functional Tf (t). A model for this operator will achieve uniform approximation of all these functionals. Thus, in this paper, we are interested in approximating every functional belonging to a class F , satisfying some minimal conditions, which we now describe. We recall that a function Ä: (0, ∞) → (0, ∞) is said to be a modulus of continuity (Lorentz 1966) if each of the following properties holds: (a) Ä(h) → 0 as h → 0, (b) Ä is a positive and increasing function, and (c) Ä is subadditive; that is, Ä(h1 + h2 ) ≤ Ä(h1 ) + Ä(h2 ),
h1 , h2 > 0.
(2.4)
For example, for any α ∈ (0, 1), the function Ä(h) = hα is a modulus of continuity. For any real function F uniformly continuous on L p , the quantity sup{|F( f ) − F(g)|: k f − gkp ≤ δ}, considered as a function of δ, is a modulus of continuity (of F). More generally, if X is a topological space and T: L p → C(X) is a bounded, uniformly continuous function, then the quantity sup{|Tf (x) − Tg(x)|: x ∈ X, k f − gkp ≤ δ}, considered as a function of δ, is a modulus of continuity (of T). For each x ∈ X, the modulus of continuity of T is a majorant for the modulus of continuity of the functional f → Tf (x). Motivated by this observation and the Jackson theorem in the theory of trigonometric approximation, the only assumption we wish to make about the functional is that there be a known majorant for its modulus of continuity. Accordingly, we will be interested here in the universal approximation of the class
F := FÄ,p,s := {F: L p → R: |F( f ) − F(g)| ≤ Ä(k f − gkp ), f, g ∈ L p },
(2.5)
where Ä is a given modulus of continuity. Our main objective in this article is to construct generalized translation networks to model every functional in F . We now proceed to describe these networks. Let 1 ≤ d ≤ D, N ≥ 1 be integers, and φ: Rd → R. A generalized translation network with N neurons evaluates a function of the form
Neural Networks for Functional Approximation
147
PN
k=1 ak φ(Ak (·) + bk ) where the weights Ak ’s are d × D real matrices, the thresholds bk ∈ Rd , and the coefficients ak ∈ R (1 ≤ k ≤ N). The set of all such functions (with a fixed N) will be denoted by 5φ;N,D . When d = 1, 5φ;N,D is the set of all outputs of a neural network with N neurons, each evaluating the activation function φ and receiving D inputs. When d = D and φ is a radially symmetric function, then 5φ;N,D denotes the set of all outputs of a radial basis function network. Girosi et al. (1995) have pointed out the importance of the study of the more general case considered here. They have demonstrated how such general networks arise naturally in such applications as image processing and graphics, as solutions of certain extremal problems. Our first theorem in this section is Theorem 2.1. For multiintegers k = (k1 , . . . , kd ), m = (m1 , . . . , md ) ∈ Zd , we write k ≥ m if kj ≥ mj , 1 ≤ j ≤ d. If t ∈ R, we write k ≥ t if kj ≥ t, 1 ≤ j ≤ d. The set of all k ∈ Zd with k ≥ 0 will be denoted by Zd+ . Throughout the rest of this article, we adopt the following convention regarding constants: The letters c, c1 , c2 , . . . will denote positive constants depending only on p, d, s, φ, K, and F (equivalently Ä), but independent of any other variables not explicitly indicated. Their values may be different at different occurrences, even within the same formula.
Theorem 2.1. Let d ≥ 1 be an integer, and φ: Rd → R be infinitely many times continuously differentiable in some open sphere in Rd . We further assume that there exists b in this sphere such that Dk φ(b) 6= 0,
k ∈ Zd+ .
(2.6)
Let s ≥ 1, N ≥ 1, m ≥ (log d)/s be integers, 1 ≤ p ≤ ∞, K be a compact subset of L p , Ä be a modulus of continuity, and F be the set as in equation 2.5. We write D := (2m + 1)s . There exist continuous linear functionals γk : C(K) → R, k = 1, . . . , N, and a continuous linear operator Lm : L p → RD , with the following property: For every F ∈ F , there exist d × D matrices Ak := Ak (F) with rank d, k = 1, . . . , N, such that ¯ ¯ N ¯ ¯ X ¯ ¯ γk (F)φ(Ak (Lm ( f )) + b)¯ sup ¯F( f ) − ¯ ¯ f ∈K n
k=1
o ≤ c Ä(²m,K ) + Ä(mβ N−1/D ) ,
(2.7)
where β := 2s max(1/p, 1 − 1/p). It may seem a bit awkward to define the matrices Ak and operator Lm as we did in the above theorem rather than just defining operators from L p into Rd . The reason for this will be explained during our discussion of the lower bounds. It is proved in Mhaskar (1996) that the condition 2.6 is satisfied for
148
H. N. Mhaskar and Nahmwoo Hahm
a large class of activation functions, including each of the following, where Pd 2 1/2 xj ) : for x ∈ Rd , we write |x| := ( j=1 d = 1, φ(x) := (1 + e−x )−1 , d ≥ 1, φ(x) := (1 + |x|2 )α , α 6∈ Z, d ≥½1, q ∈ Z, q > d/2, |x|2q−d log |x|, d even, φ(x) := |x|2q−d , d odd,
(The squashing function) (Generalized multiquadrics) (Thin plate splines)
and d ≥ 1,
φ(x) := exp(−|x|2 ).
(The gaussian function)
An important special case of the compact set K in Theorem 2.1 is the case when ²m,K = O(m−α ) for some α > 0. In this case, the order of magnitude of the quantity on the right-hand side of estimate 2.7 is minimized if we choose m to be a solution of the equation (2m + 1)s log m =
log N ; α+β
(2.8)
that is, (cf. Olver 1974, Ex. 5.7, p. 14) µ m∼
log N log log N
¶1/s .
(2.9)
(Our notation here is different from that in Olver 1974. By A ∼ B, we mean c1 B ≤ A ≤ c2 B.) The right-hand side of estimate 2.7 is then estimated by õ ¶ ! log log N α/s . cÄ log N We observe that if NP denotes the number of real parameters in the network of estimate 2.7, then this order of magnitude is also õ ¶ ! log log NP α/s . cÄ log NP This looks like a very weak rate of convergence indeed, but we will prove that under the weak a priori assumptions we have made, this result cannot be improved in general, except possibly for the factor of log log NP in the numerator above. Before describing these lower bounds, we take this opportunity to make a few remarks about the construction of the networks. We observe that the operator Lm is independent of the functionals F or the input signals f . During the proof, we will give explicit expressions for the quantities Lm , Ak ,
Neural Networks for Functional Approximation
149
and the coefficients γk . It will then be seen that the matrices Ak (and clearly the threshold b) are all uniformly bounded, independent of the accuracy desired. It will also be seen that the matrices Ak depend not so much on F itself but on the entire class F and on sup{|F(P)|: P ∈ 52m,s , kPkp ≤ c(N, m, p, s)} for some constant c depending on the indicated parameters only. In particular, if we restrict the set F to consist of functionals satisfying an a priori bound, then the matrices Ak will also be independent of F. In function approximations, a bound on the Sobolev norm already implies such a bound on the norms of the target functions. There are technical reasons for not imposing such a bound at the outset in this context. Nevertheless, there is no training involved in finding these and other parameters of the network, and we have avoided all the shortcomings of the usual optimization-based techniques, such as backpropagation. Necessarily, the coefficients γk will be unbounded as the accuracy tends to 0; this is inevitable even in the case of ordinary function approximation (Mhaskar forthcoming). This situation is akin to the case of the familiar polynomial approximation on the interval. If one writes the approximating polynomials in terms of the monomials, the coefficients are far from being well behaved. A change of basis to some system of orthogonal polynomials cures this problem. It will be seen from our proof that this is exactly the situation here as well. Thus, to implement the networks in practice, one should first implement certain auxiliary networks. These are independent of the actual systems being implemented, and the extra effort will pay off in terms of more stable coefficients of the resulting networks in actual approximation of the systems. In the case of radial basis function networks, our matrices are not all equal to the identity matrix; our networks are not pure translation networks. The ideas in Mhaskar (1995) can be used to construct gaussian networks to achieve the same rate of approximation, where only pure translations are required. We will not pursue these ideas here. An examination of our proofs here and those in Mhaskar (1995) will show clearly how to do this. No new ideas are involved in this construction. Now we turn our attention to the lower bounds. These will be obtained for any models, not just for generalized translation networks. A model can be described mathematically in terms of two functions: a function π : F → RN , which selects the parameters of the model, and a function MN : RN → C(K), which reconstructs the model using these parameters. For any system F ∈ F , MN (π(F)) is the model simulating F. For example, in the case of a neural network model, N denotes the total number of its real parameters, including the coefficients, components of weights, and thresholds. The training of the network consists of defining the function π , which defines these parameters given a target function. The mapping M then defines the output of the network itself, given these parameters. In our formulation of Theorem 2.1,
150
H. N. Mhaskar and Nahmwoo Hahm
the operator Lm , being independent of F and f , is part of the definition of M. The lower bounds in the discussion below will apply also even if the weights Ak were to depend on F in a much more complicated manner. The split formulation helps us to allow this dependence without introducing new definitions of “widths” (see below) and also underlines the possible interdependence of π and M. It is naturally desirable that the function π should be continuous on F , so as to achieve a robust model. As pointed out by Helmicki et al. (1991), it is desirable to measure the error in this simulation in the worst-case scenario. Mathematically, this amounts to estimating the quantity sup |F( f ) − MN (π(F), f )| . f ∈K
Since F itself is unknown as well, we are led to consider sup |F( f ) − MN (π(F), f )| .
F∈F , f ∈K
The inherent error in modeling is then measured by the nonlinear N-width (DeVore et al. 1989), defined as 1N (F ) := inf
sup |F( f ) − MN (π(F), f )|,
MN ,π F∈F , f ∈K
(2.10)
where the infimum is taken over all continuous functions π : F → RN and operators MN : RN → C(K). The following theorem gives a lower bound for the nonlinear N-width for F in case of certain compact sets K ⊂ L2 . To describe this compact P set, we write |k|∞ = max1≤j≤s |kj |, k = (k1 , . . . , ks ) ∈ Zs . For f ∈ L2 , let ak ( f )Pk denote the Fourier-Legendre expansion of f (cf. formulas 3.2 and 3.5). Theorem 2.2.
Let s ≥ 1 be an integer, α > 0,
( K :=
2
f ∈L :
X
) 2 |k|2α ∞ |ak ( f )|
≤1 .
(2.11)
k
Let Ä be a modulus of continuity. Then for integer N ≥ 1, 1N (FÄ,2,s ) ≥ cÄ((log N)−α/s ).
(2.12)
3 Proofs In this section, we prove Theorems 2.1 and 2.2. The idea behind the proof of Theorem 2.1 is the following: From F and K, we will construct a family of functions on a ball in RD . Each member of this family will be approximated
Neural Networks for Functional Approximation
151
by a polynomial of a certain degree n, where the rate of approximation is given by Newman and Shapiro (1964). Neural networks to approximate such polynomials are developed in Mhaskar (1996). We start this program by recalling the definition and a few facts about Legendre polynomials. A standard way to define the univariate, orthonormalized Legendre polynomial is by Rodrigues’s formula (Szego¨ 1975): √ µ ¶ (−1)n n + 1/2 d n Pn (x) := {(1 − x2 )n }, 2n n! dx x ∈ R, n = 0, 1, . . . .
(3.1)
If ` ≥ 2, k = (k1 , . . . , k` ) ∈ Z`+ , and x = (x1 , . . . , x` ) ∈ R` , then we write
Pk (x) :=
` Y
Pkj (xj ).
(3.2)
j=1
It is well known that ½ Z 1, Pk (x)Pm (x) dx = 0, [−1,1]`
if k = m, otherwise.
(3.3)
If g ∈ L1 ([−1, 1]` ), its Fourier-Legendre coefficients are defined by Z ak (g) :=
[−1,1]`
g(t)Pk (t) dt,
(3.4)
and the expression X
ak (g)Pk ,
(3.5)
k∈Z` ,k≥0
convergent or not, is called its Fourier-Legendre expansion. Even if the expansion might not converge, the multisequence {ak (g)} uniquely determines g. We develop some further notation. If D is as in Theorem 2.1, a ∈ RD , then we may index the coordinates of a by elements of {0, 1, . . . , 2m}s instead of the usual indexing by elements of {1, 2, . . . , (2m + 1)s }. Then the operator I5: RD → 52m,s is defined by I5(a) :=
X
ak Pk .
(3.6)
0≤k≤2m
The quantity |a| will denote the Euclidean norm of a. The following lemma establishes an important connection between |a| and kI5(a)kp .
152
H. N. Mhaskar and Nahmwoo Hahm
Lemma 3.1.
Let m ≥ 0 be an integer, D = (2m + 1)s . Then, with
Am,p,s := m2s max(1/p−1/2,0) ,
Bm,p,s := m2s max(1/2−1/p,0) ,
(3.7)
we have |a| ≤ cAm,p,s kI5(a)kp ≤ c1 Am,p,s Bm,p,s |a| = c1 mβ−s |a|,
(3.8)
where β := 2s max(1/p, 1 − 1/p). Proof. If 0 < q, r ≤ ∞, ` ≥ 1 be an integer, and P ∈ 5`m,1 is a univariate polynomial, then it is known (Nevai 1979, p. 114) that ½ kPkq,[−1,1] ≤
cm2(1/r−1/q) kPkr,[−1,1] , ckPkr,[−1,1] ,
if r < q if r ≥ q .
(3.9)
The case q = ∞ is not included on p. 114 of Nevai (1979). However, it forms the basis for the proof, and is proved separately in the form of Theorem 13 (p. 113) and Lemma 5 (p. 108) of Nevai (1979). The second inequality above is, of course, just a consequence of Holder’s ¨ inequality. If P ∈ 5`m,s , we apply the above estimates one by one to each of the coordinates of its argument to obtain ½ 2s(1/r−1/q) kPkr , if r < q cm (3.10) kPkq ≤ if r ≥ q . ckPkr , The estimate 3.8 is now easy to deduce using the Parseval identity. The following lemma gives the first step of our proof: the construction of a family of functions on a ball of RD . We assume the notation and conditions of Theorem 2.1. Lemma 3.2.
There exists a continuous linear operator Lm : L p → RD such that
|F( f ) − F(I5(Lm ( f )))| ≤ cÄ(²m,K ),
F ∈ F , f ∈ K.
(3.11)
Moreover, |Lm ( f )| ≤ CAm,p,s k f kp ,
f ∈ L p,
(3.12)
where C > 0 is a constant depending only on p and s, but not on m. Proof. We take any continuous linear operator Vm : L p → 52m,s such that k f − Vm ( f )kp ≤ cEm,p,s ( f )
(3.13)
Neural Networks for Functional Approximation
153
for all f ∈ L p , where c is a positive constant depending only on p and s, and not on f or m. There are many such operators known in the literature (cf. Timan 1963; Lorentz 1966; Mhaskar 1996) and the actual choice is immaterial for this proof. We write Lm ( f ) := (ak (Vm ( f )))0≤k≤2m,k∈Zs .
(3.14)
Then Lm is a continuous linear operator on L p , and k f − I5(Lm ( f ))kp = k f − Vm ( f )kp ≤ c²m,K ,
f ∈ K.
(3.15)
The estimate 3.11 is now clear. From Lemma 3.1, we obtain for f ∈ L p that |Lm ( f )| ≤ cAm,p,s kVm ( f )kp ≤ cAm,p,s k f kp .
(3.16)
This proves (3.12). Until the end of the proof of Theorem 2.1, we retain the value of the constant C in estimate 3.12. In order to describe the next step of our proof, we define for any F ∈ F , a function µm (F) of D variables by the formula µm (F, a) := F(I5(a)),
a ∈ RD .
(3.17)
The next lemma gives a bound on the degree of polynomial approximation of µm (F), thus completing the second step in our proof. Lemma 3.3. With the notation as in Theorem 2.1, for any F ∈ F , and integer n ≥ 1, there exists a polynomial Q(F) ∈ 5n,D such that |µm (F, a) − Q(F, a)| ≤ cÄ(mβ /n),
|a| ≤ CAm,p,s MK .
(3.18)
Proof. Using Lemma 3.1 and the definitions of the functions µm , I5, we obtain for any a, d ∈ RD , |µm (F, a) − µm (F, d)| = |F(I5(a)) − F(I5(d))|
(3.19)
≤ Ä(kI5(a) − I5(d)kp ) ≤ cÄ(Bm,p,s |a − d|). The assertions of Lemma 3.3 now follow from Theorem 3 in Newman and Shapiro (1964). Finally, we prove another lemma, which will help us to replace the polynomials Q(F) by generalized translation networks.
154
H. N. Mhaskar and Nahmwoo Hahm
Lemma 3.4. Let φ satisfy the conditions of Theorem 2.1, n ≥ 1 be an integer, and k ∈ ZD + and k ≤ n. Then for every ² > 0, there exists Gk,n,² ∈ 5φ;(6n+1)D ,D such that kPk − Gk,n,² k∞,[−1,1]D ≤ ². The matrices involved in Gk,n,² may be chosen to be matrices of rank d. Further, the matrices in each Gk,n,² may be chosen from a fixed set with cardinality not exceeding (6n + 1)D . The threshold of each neuron in each Gk,n,² is b as defined in condition 2.6. Proof. The proof of this lemma is almost verbatim the same as that of Lemma 3.2 of Mhaskar (1996). We omit the details, but point out only the necessary changes. For w ∈ RD , we define the d × D matrices Aw by Aw := diag(w , . . . , w ) 1 d
.. . .. . .. .
wd+1 , . . . , wD , ·········
(3.20)
0
and write 8(w; x) := φ(Aw x + b),
x ∈ RD .
Then, as in Lemma 3.2 of Mhaskar (1996), we may express every monomial xp (p ∈ ZD + ) as a multiple of a derivative of 8 with respect to w. A divided difference then gives a network 8p,h ∈ 5φ;(p1 +1)...(pD +1),D that approximates xp uniformly on [−1, 1]D to any desired degree of accuracy. The network Gk,n,² is obtained by expressing Pk in terms of monomials and then replacing each monomial by the approximating network. The matrices involved are of the form of equation 3.20. By slight perturbations, one may also ensure the matrices to be of rank d and yet ensure that the set of matrices is fixed, independent of k. Proof of Theorem 2.1. If f ∈ K, F ∈ F , estimate 3.12, and Lemma 3.3 gives a polynomial Q(F) ∈ 5n,D such that |F(I5(Lm ( f ))) − Q(F, Lm ( f ))| = |µm (F, Lm ( f )) − Q(F, Lm ( f ))|
(3.21)
β
≤ cÄ(m /n). From estimate 3.11, we now obtain for all f ∈ K, F ∈ F , © ª |F( f ) − Q(F, Lm ( f ))| ≤ c Ä(²m,K ) + Ä(mβ /n) .
(3.22)
Neural Networks for Functional Approximation
We write Q(F) =:
X
155
dk (F)Pk .
k∈ZD + ,k≤n
Then X
|dk (F)| ≤ c(n, m, p, s)
k∈ZD + ,k≤n
≤ c(n, m, p, s) ≤ c(n, m, p, s)
max
|Q(F, a)|
max
|µm (F, a)|
|a|≤CAm,p,s MK |a|≤CAm,p,s MK
max
P∈52m,s ,kPkp ≤c2 (n,m,p,s)
|F(P)|.
We may then use the networks constructed in Lemma 3.4 to arrive (after an elementary rescaling) at a network approximating Q(F) to an arbitrary degree of approximation on the ball in RD of radius CAm,p,s MK . Moreover, these networks have the form required in Theorem 2.1. In view of estimate 3.22, these networks evaluated at Lm ( f ) give the required degree of approximation to F( f ). Next, we turn to the proof of Theorem 2.2. As expected, the proof of this theorem relies on the notion of the Bernstein widths. If X is a normed linear space, k · k is its norm, A ⊆ X, and N ≥ 1 is an integer, the Bernstein Nwidth of A (with respect to X) is defined as follows. Let XN+1 be the set of all linear subspaces of X having dimension not exceeding N + 1. The Bernstein N-width is given by the expression bN (A) := sup {ρ: ρ f/k f k ∈ A for all f ∈ Y}.
(3.23)
Y∈XN+1
In the context of this article, we take X to be the space C(K) with the natural supremum norm. According to a result of DeVore et al. (1989), 1N (F ) ≥ bN (F ),
N = 1, 2, . . . .
(3.24)
Therefore, we will obtain a lower bound for bN (F ). The idea is the same as in Theorem C in Newman and Shapiro (1964). We provide some details since we are unable to find a precise reference in the context of Bernstein N-widths. Proof of Theorem 2.2. In this proof, we write F := FÄ,2,s . Let m be an integer such that 2m + 1 ∼ (log N)1/s , and D := (2m + 1)s . Following Newman and Shapiro (1964, Lemma 1), we obtain N + 1 points a1 , . . . , aN+1 in RD , such that |ai | ≤ 1/Dα/s ,
i = 1, . . . , N + 1,
(3.25)
156
H. N. Mhaskar and Nahmwoo Hahm
and |ai − aj | ≥ (2Dα/s N1/D )−1 ,
i 6= j, i, j = 1, . . . , N + 1.
(3.26)
Then the functions fi := I5(ai )
(3.27)
are in K, and k fi − fj k2 = |ai − aj | ≥ (2Dα/s N1/D )−1 =: 2δ,
i 6= j, i, j = 1, . . . , N + 1. (3.28)
Next, we define for f ∈ L2 , i = 1, . . . , N + 1, ½ Fi ( f ) :=
Ä(δ − k f − fi k2 ), 0,
if k f − fi k2 < δ otherwise.
(3.29)
on L2 . If Y is the linear Then each Fi is a bounded, continuous, real functionP di Fi be an arbitrary span of Fi , then Y has dimension N + 1. Let F = element of Y. Then kFk := sup |F( f )| = Ä(δ) max |di | . f ∈K
1≤i≤N+1
(3.30)
Using the monotonicity and subadditivity of Ä, if k f − fi k2 , kg − fi k2 ≤ δ, then ¯ ¯ (3.31) |F( f ) − F(g)| = |di | ¯Ä(δ − k f − fi k2 ) − Ä(δ − kg − fi k2 )¯ ≤ |di |Ä(kg − fi k2 − k f − fi k2 |) kFk Ä(k f − gk2 ). ≤ Ä(δ) If k f − fi k2 , kg − fj k2 ≤ δ and i 6= j, then we pick h, h˜ with kh − fi k2 = δ, ˜ 2 ≤ k f − gk2 . (For example, we may take kh˜ − fj k2 = δ, k f − hk2 , kg − hk the “line” joining f and g, and let h and h˜ be the “points” where this “line” ˜ = 0 and, intersects the balls around fi and fj , respectively.) Then F(h) = F(h) using the estimate just proved, ˜ − F(g)| |F( f ) − F(g)| ≤ |F( f ) − F(h)| + |F(h) kFk kFk Ä(k f − hk2 ) + Ä(kh˜ − gk2 ) ≤ Ä(δ) Ä(δ) 2kFk Ä(k f − gk2 ) . ≤ Ä(δ)
(3.32)
Neural Networks for Functional Approximation
157
The case when at most one of f and g are in a ball of radius δ around one of the fi ’s is simpler. We have therefore proved that |F( f ) − F(g)| ≤
2kFk Ä(k f − gk2 ) Ä(δ)
(3.33)
for all f, g ∈ L2 . Consequently, any F ∈ Y with kFk ≤ Ä(δ)/2 is in F , that is, bN (F ) ≥ (1/2)Ä(δ) .
(3.34)
From the definition of D and δ (3.28), we see that N1/D ∼ 1 and δ ∼ (log N)−α/s . Therefore, the proof of Theorem 2.2 is complete in view of the estimate 3.24. 4 Conclusions Given a continuous, not necessarily linear, functional F on an L p space, we have constructed generalized translation networks that provide a nearoptimal approximation to F in terms of the number of neurons involved. P Our networks evaluate a functional of the form N k=1 γk φ(Ak (Lm ( f )) + b), p where f is the input signal in the L -space and φ: Rd → R is the activation function. This setup covers the conventional neural networks (d = 1), as well as radial (elliptical) basis function networks, where d = s and φ is a radially symmetric function. In our construction, Lm is a linear operator mapping f to a judiciously chosen number of coefficients in a polynomial approximation to f . In particular, this operator is independent of both the input signals and the system to be approximated. The remaining parameters are obtained as in Mhaskar (1996) to approximate a function of finitely many real variables, constructed from F and Lm . The parameters’ Ak ’s are full-rank matrices depending only on a norm of F, but are the same for all F having the same norm bound. The parameter b depends only on φ. The parameters γk are continuous linear functionals of F. Explicit formulas are given for all of these parameters; one does not need to train the network in the usual sense. In particular, our constructions are explicit and deterministic and do not involve any iterative and/or optimization techniques, such as backpropagation. Our main objective is to obtain estimates on the number of neurons in terms of the desired accuracy in approximation, and the properties of the functional F and the compact set on which the approximation takes place. Our estimates are valid also when a uniform approximation of a uniformly continuous operator on L p , taking values in a class of bounded, continuous functions, is desired. Lower bounds for the degree of approximation are also obtained. Although this article is mainly of a theoretical nature, the proofs suggest actual constructions based on classical polynomial approximation theory.
158
H. N. Mhaskar and Nahmwoo Hahm
We have not conducted any numerical experiments, but the approximation theory tools used here have been well tested over the last several decades. Acknowledgments The research of H. N. Mhaskar was supported, in part, by National Science Foundation Grant DMS 9404513 and Air Force Office of Scientific Research Grant F49620-93-1-0150. We are grateful to Professor Dr. I. Sandberg and one of the referees for clarifying the history of this problem, as well as to both referees for their valuable suggestions for improving the presentation of this article. References Barron, A. R. 1993. Universal approximation bounds for superposition of a sigmoidal function. IEEE Trans. Information Theory 39, 930–945. Chen, T., and Chen, H. 1993. Approximation of continuous functionals by neural networks with application to dynamical systems, IEEE Trans. Neural Networks 4, 910–918. Chen, T., and Chen, H. 1995a. Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Trans. Neural Networks 6, 911–917. Chen, T., and Chen, H. 1995b. Approximation capability to functions of several variables, nonlinear functionals, and operators by radial basis function neural networks. IEEE Trans. Neural Networks 6, 904–910. Cybenko, G. 1989. Approximation by superposition of sigmoidal functions. Mathematics of Control, Signal and Systems 2, 303–314. DeVore, R., Howard, R., and Micchelli, C. A. 1989. Optimal nonlinear approximation. Manuscripta Mathematica 63, 469–478. Dingankar, A. T. 1995. On applications of approximation theory to identification, control, and classification. Ph.D. dissertation, University of Texas at Austin. Girosi, F., Jones, M., and Poggio, T. 1995. Regularization theory and neural networks architectures. Neural Computation 7, 219–269. Goffman, C., and Pedrick, G. 1965. First Course in Functional Analysis. Prentice Hall, Englewood Cliffs, NJ. Helmicki, A. J., Jacobsen, C. A., and Nett, C. N. 1991. Control oriented system identification: A worst case deterministic approach in H∞ . IEEE Trans. Auto. Control 3, 1163–1176. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359–366. Levin, A. U., and Narendra, K. S. 1995. Identification using feedforward networks. Neural Computation 7, 349–357. Lorentz, G. G. 1953. Bernstein Polynomials. University of Toronto Press, Toronto. Lorentz, G. G. 1966. Approximation of Functions. Holt, Rinehart and Winston, New York.
Neural Networks for Functional Approximation
159
Mhaskar, H. N. 1995. Versatile gaussian networks. Proceedings of IEEE Workshop on Nonlinear Image and Signal Processing, I. Pitas, ed., pp. 70–73. Halkidiki, Greece, IEEE. Mhaskar, H. N. 1996. Neural networks for optimal approximation of smooth and analytic functions. Neural Computation 8, 164–177. Mhaskar, H. N. Forthcoming. On smooth activation functions. Annals of Mathematics in Artificial Intelligence. Mhaskar, H. N., and Micchelli, C. A. 1995. Degree of approximation by neural and translation networks with a single hidden layer. Advances in Applied Mathematics 16, 151–183. Modha, D. S., and Hecht-Nielsen, R. 1993. Multilayer functionals. In Mathematical Approaches to Neural Networks, J. G. Taylor, ed., pp. 235–260. Elsevier Science, North Holland. Narendra, K. S., and Parthasarathy, K. 1990. Identification and control of dynamic systems using neural networks. IEEE Trans. Neural Networks 1, 4–27. Nevai, P. 1979. Orthogonal polynomials. Memoirs of Amer. Math. Soc. 18. American Mathematical Society, Providence, RI. Newman, D. J., and Shapiro, H. S. 1964. Jackson’s theorem in higher dimensions. In Approximation Theory, Proc. Conf. Math. Res. Inst., Oberwolfach, 1963, P. L. Butzer and J. Korevaar, eds., ISNM 5, pp. 208–219. Birkh¨auser, Basel. Park, J., and Sandberg, I. W. 1991. Universal approximation using radial basis function networks. Neural Computation 3, 246–257. Olver, F. W. J. 1974. Asymptotics and Special Functions. Academic Press, New York. Sandberg, I. W. 1991a. Approximation theorems for discrete time systems. IEEE Trans. Circuits and Systems 38, 564–566. Sandberg, I. W. 1991b. Gaussian basis functions and approximations for nonlinear systems. In Digest of the Ninth Kobe International Symposium on Electronics and Information Sciences, pp. 3-1–3-6. Kobe, Japan. Sobolev, S. L. 1963. Applications of Functional Analysis in Mathematical Physics. Trans. Math. Monographs 7. American Mathematical Society, Providence, RI. Sontag, E. D. N.d. Some topics in neural networks and control, Rep. LS93-02. Siemens Corporate Research. Szego, ¨ G. 1975. Orthogonal Polynomials. Amer. Math. Soc. Coll. Publ. 23. American Mathematical Society, Providence, RI. Timan, A. F. 1963. Theory of Approximation of Functions of a Real Variable. Macmillan Co., New York.
Received January 8, 1996; accepted April 1, 1996.
Communicated by David Cohn and Mark Plutowski
Selecting Optimal Experiments for Multiple Output Multilayer Perceptrons Lisa M. Belue Kenneth W. Bauer, Jr. Department of Operational Sciences, Department of the Air Force, Air Force Institute of Technology, 2950 P Street, Wright Patterson AFB, OH 45433-7765 USA
Dennis W. Ruck Department of Electrical and Computer Engineering, Department of the Air Force, Air Force Institute of Technology, 2950 P Street, Wright Patterson AFB, OH 45433-7765 USA
Where should a researcher conduct experiments to provide training data for a multilayer perceptron? This question is investigated, and a statistical method for selecting optimal experimental design points for multiple output multilayer perceptrons is introduced. Multiple class discrimination problems are examined using a framework in which the multilayer perceptron is viewed as a multivariate nonlinear regression model. Following a Bayesian formulation for the case where the variance-covariance matrix of the responses is unknown, a selection criterion is developed. This criterion is based on the volume of the joint confidence ellipsoid for the weights in a multilayer perceptron. An example is used to demonstrate the superiority of optimally selected design points over randomly chosen points, as well as points chosen in a grid pattern. Simplification of the basic criterion is offered through the use of Hadamard matrices to produce uncorrelated outputs.
1 Introduction In this article a cohesive strategy is developed for selecting a set of design points in experimentation so as to develop accurate multilayer perceptron classifiers in a multiclass discrimination setting. Neural network researchers and practitioners often assume that training data are “provided by a source that we do not control” (MacKay 1992). The efficient selection of data is Neural Computation 9, 161–183 (1997)
c 1997 Massachusetts Institute of Technology °
162
Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck
necessary, however, if experiments to gather data are time-consuming or expensive. (See Rogers and Kabrisky 1991 for a general background and history of multilayer perceptrons.) In experimental design, one seeks to select points from some region of operability such that the design defined by these points is in some sense optimal. Most often this selection process consists of developing some sensible criterion based on an assumed model and using this criterion to obtain a design. In 1943 Wald proposed a determinant criterion. The determinant is used as a measure of the “smallness” of the matrix. In 1959, Box and Lucas demonstrated how such a criterion could be applied to univariate nonlinear models (St. John and Draper 1975). Their criterion is constructed to minimize the volume of the joint confidence ellipsoid for the parameters in a model. This article explores the synthesis of the neural network and experimental design topic areas. The following example illustrates the setting: A researcher (chemist, engineer, biotechnician, etc.—whichever is most familiar) believes that the performance of a system he or she is investigating would be best modeled by a multilayer perceptron with some set of physical characteristics of the system as inputs. A small amount of screening data is available. The researcher wishes to perform additional experiments to develop the most accurate multilayer perceptron classifier possible. Which experiments should be performed? Research has been conducted for selecting data that are expected to improve the generalization performance of neural networks (Cohn et al. 1990; Baum and Lang 1991; Hwang et al. 1991; Plutowski and White 1993). Results based on optimal experimental design have been described by Kinzel and Ruj´an (1990) and Watkin and Rau (1992) for simple perceptrons and by Sollich (1994) for linear perceptrons. Cohn (1994) uses an inverse Hessian method to choose a trajectory through the input space, choosing a single point to be added to the training set at each step. The approach of maximizing the “information matrix” of a multilayer perceptron was described by MacKay (1992) using a sampling theoretic perspective similar to that found in Fedorov (1972). MacKay derives a criterion that measures how useful a data point is expected to be. His strategy centers around maximizing the expected information gain of a new data point. In this article, we develop and illustrate the use of a practical procedure for selecting a new set of data points for multiple output multilayer perceptrons. Specifically, we adopt a Bayesian framework based on the works of Box and Lucas (1959), Draper and and Hunter (1966), M. J. Box and Draper (1972), and others. Our procedure involves the calculation of a statistical criterion that, in the single output case, is equivalent to that proposed by McKay (1992). The criterion can be challenging to compute in the multiple output case, and we offer a practical expedient to this difficulty by using desired outputs based on Hadamard matrices. Finally, we demonstrate the procedure on two example problems.
Selecting Optimal Experiments
163
2 A Design Criterion for Multiple Output Multilayer Perceptrons 2.1 The Criterion for Multiple Response Nonlinear Models. Box and Hunter (1965) assert the need for carefully selecting design points: “If experiments are not carefully planned, the experimental points may be so situated in the space of the variables that the estimates which can be obtained for the parameters are not only imprecise but also highly correlated. Once the data are collected, a statistical analysis, no matter how elaborate, can do nothing to remedy this unfortunate situation. However, by the selection of suitable experimental design in advance, these shortcomings can often be overcome.” We follow a Bayesian formulation developed by Draper and Hunter (1966). First, we establish a design of experiments criteria for the case when the variance-covariance matrix of the responses is known. Next, we establish certain approximations necessary for practical implementation of the criterion. Finally, we show extensions to these arguments when the variancecovariance matrix of the responses is unknown. Let yis (i = 1, . . . , r; s = 1, . . . , N) be the ith response of the sth observation and xs = (xs1 , xs2 , . . . , xsn ), s = 1, . . . , N represent the levels of the independent variables. Then consider the model: yis = fi (xs ; θ ) + εis
(2.1)
where fi are r response functions, θ is a vector of p unknown parameters, and εis denote independent and identically distributed random errors with εis ∼ (0, σ 2 ) and ε = (ε1s , . . . , εrs )T ∼ (0, 6). Define the quantity vij =
N0 X [yis − fi (xs ; θ )][yjs − fj (xs ; θ )]
i, j = 1, . . . , r .
(2.2)
s=1
Suppose N0 observations of the form ys = (y1s , y2s , . . . , yrs )T
s = 1, 2, . . . , N0
(2.3)
have already been obtained, and let y = (y1 , y2 , . . . , yN0 )T . Assume y1 , y2 , . . . , yr follow a multivariate normal distribution. Then, given that the N0 observations are independent, the likelihood is − 12 N0 r
p(y | θ , σ ij ) = (2π)
|A|
1 2 N0
r r X 1X exp − σ ij vij 2 i=1 j=1
(2.4)
where A = 6 −1 = {σ ij } = {σij }−1 and the σ ij are assumed known (Box and Draper 1965).
164
Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck
Since little is known a priori about the values of the parameters, a locally uniform prior distribution (Box and Draper 1965) is used so that p(θ )dθ ∝ dθ .
(2.5)
A prior distribution such as this is intended to approximate information about the model parameters that is very vague when compared with the information supplied by the data (Seber and Wild 1989). Now, the posterior distribution for θ after N0 observations is proportional to pN0 (θ | σ ij , y)dθ .
(2.6)
According to Draper and Hunter (1966), this posterior distribution can be used as the prior distribution to determine the posterior distribution after N further observations are made. The new posterior distribution for θ after N0 + N observations is proportional to − 12 (N0 +N)r
pN0 +N (θ | σ ij , y)dθ ∝ (2π)
|A|
1 2 (N0 +N)
r r X 1X exp − σ ij vij dθ , 2 i=1 j=1 (2.7)
with the upper summation limits of vij in equation 2.2 now extending to N0 + N. The goal is to select the N observations in such a way that the posterior density (see equation 2.7) obtained after N0 + N observations is maximized with respect to θ and with respect to the N observations to be chosen. From a Bayesian point of view, all the information concerning the parameters θ is contained in the posterior distribution after the observations are obtained. Therefore, the goal is altered to achieve the best possible posterior distribution by proper choice of design before the observations have actually been obtained (Draper and Hunter 1966). To apply the posterior distribution in equation 2.7, approximations are necessary. Assume that for a region in the θ -space sufficiently close to the maximum likelihood estimates θˆ that the following holds: fi (xs ; θ ) = fi (xs ; θˆ ) +
p X (θ t − θˆ t )f.(t) is
i = 1, . . . , r; s = 1, . . . , N
(2.8)
t=1
where f.(t) is
·
¸ ∂ fi (xs ; θ ) = ∂θt θ =θˆ
t = 1, . . . , p .
(2.9)
Selecting Optimal Experiments
165
Letting δis = yis − fi (xs ; θˆ )
(2.10)
one can write, r r X X
σ ij vij =
i=1 j=1
r r X X
σ ij
i=1 j=1
+
NX 0 +N
δis δjs
s=1
r r X n o X (θ − θˆ )T σ ij F.Ti F.j (θ − θˆ )
(2.11)
i=1 j=1
with
f.(1) i1 f.(1) i2 .. .
ˆ F.i (θ ) = f.(1) i(N0 +N)
f.(2) i1 f.(2) i2 .. .
f.(2) i(N0 +N)
(p)
··· ···
f.i1 (p) f.i2 .. .
···
f.i(N0 +N)
(p)
.
(2.12)
There are no terms linear in (θ − θˆ ) due to the definition of θˆ as the maximum likelihood estimate. Draper and Hunter (1966) state in their development that using equation 2.11 in equation 2.7 and performing the appropriate normalization yields: ij
− 12 p
pN0 +N (θ | σ , y) = (2π)
½ ¾ 1 T ˆ ˆ |D| exp − (θ − θ ) D(θ − θ ) , 2 1 2
(2.13)
where D=
r r X X
σ ij F.Ti F.j .
(2.14)
i=1 j=1
By definition, maximization with respect to θ occurs when θ = θˆ , evaluated after N0 + N observations. It thus remains to choose the design points so that the determinant ¯ ¯ ¯ ¯X r ¯ ¯ r X ij T ¯ σ F.i F.j ¯¯ (2.15) ¯ ¯ ¯ i=1 j=1 is maximized with respect to θ and the N design points (Draper and Hunter 1966). For practical implementation, one must calculate the values of F.i
166
Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck
using parameter estimates obtained from the N0 observations rather than all N0 + N observations. M. J. Box and Draper (1972) rely on results from Box and Draper (1965) and Draper and Hunter (1966) to derive a design point criterion for multiresponse design of experiments when 6 is unknown. They begin with the assumption that little is known about the values of the parameters and that their prior distribution will change little over a region in which the likelihood is substantial. This reasoning leads to using independent noninformative priors for θ and 6. A noninformative prior can be thought of as a guess that one hopes is unbiased. Box and Draper derive the marginal posterior distribution of θ as p(θ | y) = C |vij |− 2 N 1
where
C=
·Z
|vij |− 2 N dθ 1
(2.16)
¸−1 (2.17)
is the normalizing constant and vij denotes the entire matrix whose ijth element is vij . Here, the σ ij has been integrated out by comparing the joint posterior distribution with the Wishart distribution (Box and Draper, 1965). Parameter estimation in a nonlinear regression setting can be accomplished 1 by maximizing |vij |− 2 N . The criterion chooses “that value θˆ of θ for which the posterior density is a maximum. This maximum posterior density is a function of the experimental settings adopted, so that if some or all of the experimental settings have not yet been selected, they can be chosen so as to maximize the maximum of the posterior density with respect to θ ” (M. J. Box and Draper 1972). Using the result from equation 2.11 in equation 2.16 and letting F.i (θˆ ) = F.i , ¯( )¯− 12 N N ¯ ¯ X ¯ ¯ p(θ | y) = C ¯ δis δjs + (θ − θˆ )T F.i F.j (θ − θˆ ) ¯ (2.18) ¯ ¯ s=1 where the curly brackets denote the i, jth term of the matrix. Letting vˆ ij = PN ˆ ij } and V−1 = {vˆ ij }, the maximization of s=1 δis δjs , V = {v ¯ n o¯− 1 N ¯ ¯ 2 ¯V + (θ − θˆ )T F.i F.j (θ − θˆ ) ¯
(2.19)
is required. Equivalently, one can maximize the approximation (M. J. Box and Draper 1972) r r X X 1 vˆ ij F.i F.j (θ − θˆ ) . 1 − N(θ − θˆ )T 2 i=1 j=1
(2.20)
Selecting Optimal Experiments
It then remains to choose design points so that the determinant ¯ ¯ ¯ ¯X r ¯ ¯ r X ij T ¯ ¯ ˆ N v F. F. j i ¯ ¯ ¯ ¯ i=1 j=1
167
(2.21)
is maximized. According to Seber and Wild (1989), N V−1 is the maximumlikelihood estimator of 6 −1 . Further, it is appealing that the unknown variance-covariance elements are estimated by their natural estimators— 1/N times the sums of squares and cross products of the difference between observed and modeled responses (M. J. Box and Draper 1972). Since, in the design stage, the residuals of the unperformed experiments are unknown, all data vectors from completed experiments should be used to estimate the elements of the inverse variance-covariance matrix. The multiresponse criterion can be viewed as a weighted sum of a singleresponse criterion taken over all possible response pairs. The criterion is defined as ¯ ¯ ¯ ¯X r ¯ ¯ r X (2.22) vˆ ij F.Ti (X; θˆ )F.j (X; θˆ )¯¯ D = arg max ¯¯ xs ∈R, s=1,...,N ¯ i=1 j=1 ¯ where X = [x1 , x2 , . . . , xN ]T is a set of N data points. Note that when r = 1, the criterion in equation 2.22 is identical to the univariate criterion given by Box and Lucas (1959). 2.2 The Multilayer Perceptron Criterion. The multilayer perceptrons addressed in this paper consist of an input layer, a single hidden layer, and an output layer. The multilayer perceptrons are trained using instantaneous backpropagation with momentum. Sigmoids are applied at the hidden and output layers. We allow any number of output nodes and assume initially that the number of output nodes is equal to the number of classes in the current discrimination problem. Designs for single-output multilayer perceptrons used for two-class discrimination problems are addressed in a companion paper (Belue and Bauer 1995). Typically, the parameters in a multilayer perceptron are called weights, and the set of all weights is denoted w. Since there is no natural ordering for these weights, an ordering will be established here. The weights below the hidden layer will be listed first, with all the weights emanating from an input node listed together. Then the weights above the hidden layer will be listed, with the weights emanating from a hidden node listed together. So, w = (w111 , w112 , . . . , w11m , w121 , w122 , . . . , w12m , . . . , w1nm , ξ11 , ξ21 , . . . , ξm1 , w211 , w212 , . . . , w21r , w221 , w222 , . . . , w22r , . . . , w2mr , ξ12 , ξ22 , . . . , ξr2 )T (2.23)
168
Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck
where n is the number of inputs (dimension of the input vector), m is the number of hidden nodes, r is the number of outputs, and ξjl (j = 1, . . . , r in the output layer and j = 1, . . . , m in the hidden layer) are the weights connected to the bias elements for the lth layer. The dimension of this weight vector is given by p = (n + 1)m + (m + 1)r .
(2.24)
The input vectors will be denoted xs = (xs1 , xs2 , . . . , xsn ), s = 1, . . . , N. The nonlinear regression responses are called outputs in the multilayer perceptron arena and will be denoted zjs for the jth output node’s value given the sth input vector. The N × p matrix of first partials for output j in multilayer perceptron notation is ( s) ∂zj s = 1, . . . , N; j = 1, . . . , r; t = 1, . . . , p (2.25) F.j (w) = ∂wt where 1 wd mt e,t−m(d mt e−1) wt = w2d t e,t−r(d t e−1)−(n+1)m (n+1)m+r (n+1)m+r
for t ≤ (n + 1)m for (n + 1)m < t
(2.26)
≤ (n + 1)m + (m + 1)r
and dαe is the smallest integer greater than α. Then, for lower layer weights, ∂zjs
1s s = zjs (1 − zjs )w2ij x1s i (1 − xi )xk
∂w1ki
(2.27)
where x1s i is the activation of the ith hidden node given the input vector s. Similarly for upper layer weights, ∂zjs ∂w2ij
= zjs (1 − zjs )x1s i .
(2.28)
The elements of the estimated variance-covariance matrix for the outputs based on N exemplars are: ˆ ij = 6 =
1 vˆ ij N N 1 X N
s=1
[dsi − zsi ][djs − zjs ]
(2.29)
Selecting Optimal Experiments
169
where dsi is the desired output for the ith output node given the sth exemplar and zsi is the actual output of the ith output node for the sth exemplar. The design point criterion for a multiple-output multilayer perceptron follows directly from the theoretical development, which begins in equation 2.1. This development assumes the presence of gaussian noise. Such an assumption is appropriate for regression problems but may not be appropriate for classification problems (Rummelhart 1995). In the example classification problems that follow, we have used gaussian noise as an approximation. Note that in order to calculate F.j (w), w, or some estimate of it, must be ˆ is obtained by training a multilayer perceptron known. An initial estimate w ˆ can be formed. ˆ is determined, then F.j (w) on existing data vectors. Once w A second use of this initial data set is to determine the appropriate network architecture. Steppe et al. (1995) develop methods for feature input and model selection. The design point criterion using multilayer perceptron notation is ¯ ¯ ¯ ¯X r ¯ ¯ r X ˆ j (X; w) ˆ ¯¯ . (2.30) vˆ ij F.Ti (X; w)F. D = arg max ¯¯ xs ∈R, s=1,...,N ¯ i=1 j=1 ¯ Our proposed algorithm selects and queries several design points at a time, in contrast to selecting and querying a single design point. The obvious advantage of choosing several points is that multilayer perceptron training time is reduced; however, this advantage is tempered by the costs inherent in obtaining the extra data points. 3 Method of Maximization Powell’s maximization method was employed for finding the optimal design points (Powell 1973; Reklaitis et al. 1983; Press et al. 1989; Vanderplaats 1984). Powell’s method is a zero-order algorithm, in that only evaluations of the original function are used to determine the maximum. According to Vanderplaats (1984), Powell’s method, including its subsequent modifications, is one of the most efficient and reliable of the zero-order methods. The algorithm is based on the concept of conjugate directions. Given an n × n symmetric matrix H, the directions S(1) , S(2) , . . . , S(r) , r ≤ n are said to be H conjugate if the directions are linearly independent and (Si )T HS j = 0
i 6= j .
(3.1)
The significance of conjugacy is that given a quadratic function, the function will be maximized in n or fewer conjugate search directions. (See Belue 1995 and Reklaitis et al. 1983 for more details on the P method.) Pr vˆ ij F.Ti F.j | requires As a practical matter, the maximization of | ri=1 j=1 some region of operability over which to search. We assumed that all data
170
Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck
sets were normalized so that each feature was in the range [0, 1]. This is standard practice for multilayer perceptron inputs (Rogers and Kabrisky 1991). Powell’s method is an unconstrained maximization algorithm; hence, it was modified so that after a locally good search direction is established, a constrained line maximization is performed to find the optimum point along the search direction. Powell’s algorithm will converge in n search directions if the function is quadratic. Reklaitis et al. (1983) state that “it can be proven rigorously under reasonable hypothesis that the method will converge to a local maximum and will do this at at a superlinear rate.” Since convergence is to a local maximum (and not global), different starting points will result in different solutions. Therefore, multiple runs were accomplished, and the run producing the maximum value of the determinant was used. Finally, since Powell’s method does not guarantee global optimality, it is possible that the generated design points will be suboptimal. 4 Example Discrimination Problem As a first demonstration of the multiresponse criterion, a two-dimensional, four-class discrimination problem was used. We trained the multilayer perceptron 30 times using a subset of 100 of the training vectors. Each of the resulting weight vectors was tested on a remaining (disjoint) set of 500 training vectors; we chose the weight vector with the lowest classification error as a starting point for calculating the criterion. Classification error is defined as the percentage of vectors incorrectly classified in a data set. Figure 1 shows the discrimination problem and the multilayer perceptron boundary resulting from this initial weight vector. Powell’s method was then used to maximize the determinant criterion and choose 74 (letting N = p) design points. The chosen design points were classified according to the truth model and added to the training set. The multilayer perceptron was then retrained with the entire set of 174 exemplars (100 initial + 74 design points). Figure 2 shows the resulting design points and the boundary formed by the trained multilayer perceptron. For purposes of comparison, the multilayer perceptron was also trained with 74 randomly chosen points added to the training set. Table 1 shows the average test set classification error (after 1000 epochs) for 30 independently trained multilayer perceptrons for the initial training data and for the design point method training data as compared to randomly chosen design points. 5 Reducing the Complexity of Design Point Determination The design point criterion (see equation 2.30) is relatively simple in form. However, it requires a large number of calculations. To calculate the design point criterion for a single candidate set of design points requires r2 evaluations of the form vˆ ij F.Ti F.j . Each matrix F.j requires Np evaluations of the
Selecting Optimal Experiments
Figure 1: “True” boundary and initial learned boundary.
Figure 2: Design points and resulting boundary.
171
172
Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck
Table 1: Summary of Test Set Classification Error.
Method
Training set
Initial
100 initial training vectors
Random
100 initial training vectors + 74 random points 100 initial training vectors + 74 chosen points
Design points
Classification error (%) (30 networks) Mean ± std. error 26.85 ± .93 20.16 ± .49 18.27 ± .49
∂z
derivative ∂wj . In turn, these derivatives require evaluation of the multilayer perceptron at each of N points. All of these calculations are required for a single evaluation of a single term in the design point criterion. Powell’s algorithm may require thousands of evaluations of the criterion. Needless to say, simplification of the criterion would be a desirable step toward practical implementation. 5.1 A Reduced Criterion. The multiresponse design point criterion (equation 2.30) could be simplified if the responses were uncorrelated, that is, if N V−1 (the maximum likelihood estimator of 6 −1 ) were a diagonal matrix. Then, rather than r2 terms of the form vˆ ij F.Ti F.j there would be r terms of that form. One way to accomplish this goal is to “preset” the desired outputs in a manner that will produce uncorrelated actual outputs, resulting in a nearly diagonal estimated variance-covariance matrix. We attempt to produce such a situation by training the multilayer perceptron to desired outputs based on the rows of modified Hadamard matrices (Roberts 1989). The Hadamard matrix is a square array whose elements are only +1 and −1 and whose rows and columns are orthogonal to one another. A symmetrical Hadamard matrix with the first column containing +1’s is known as the normal form for the Hadamard matrix. The lowest order Hadamard matrix is of order two: · H2 =
1 1
1 −1
¸ .
(5.1)
Higher-order matrices, restricted to having powers of two, can be obtained from the recursive relationship H N = H N ⊗ H2 2
(5.2)
Selecting Optimal Experiments
173
Table 2: Traditional Desired Outputs for Four Class Discrimination. Class 1
Class 2
Class 3
Class 4
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
Output 1 Output 2 Output 3 Output 4
Table 3: Desired Outputs for Four Class Discrimination. Class 1
Class 2
Class 3
Class 4
1 1 1 1
1 0 1 0
1 1 0 0
1 0 0 1
Output 1 Output 2 Output 3 Output 4
where N is a power of 2 and ⊗ denotes the Kronecker product defined as A⊗B=
a11 B a21 B .. . am1 B
a12 B a22 B .. . am2 B
··· ··· ···
a1m B a2m B .. . amm B
(5.3)
(Beauchamp 1984). For classification problems, multilayer perceptron desired outputs are typically binary words denoting class membership. A “traditional” coding scheme is shown in Table 2. Note that if we estimate the correlation between output 1 and output 2 based on these four vectors, there is a nonzero correlation. Similarly, the correlation is nonzero for the other outputs. Adjusting the Hadamard matrix (see Table 3) to include 0’s rather than −1’s, we find that the estimated correlation is 0 for all outputs. Notice that output 1 is the same (one) for each class and makes no contribution to classification. Therefore, this output will be removed, leaving three outputs for the classification of four classes. Assuming an equal number of training exemplars from each class, the idealized form of the variancecovariance matrix for the three desired outputs is
0.3333 ˜ = 0 6 0
0 0.3333 0
0 0 . 0.3333
(5.4)
174
Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck
Table 4: Desired Outputs for Seven Class Discrimination.
Output 1 Output 2 Output 3 Output 4 Output 5 Output 6 Output 7
Class 1
Class 2
Class 3
Class 4
Class 5
Class 6
Class 7
1 1 1 1 1 1 1
1 0 1 0 1 1 1
1 1 0 0 1 1 1
1 0 0 1 1 1 1
1 0.5 0.5 0.5 0 1 1
1 0.5 0.5 0.5 0.8 0 1
1 0.5 0.5 0.5 0.8 0.83333 0
Belue (1995) offers an algorithm for forming uncorrelated outputs in the case where the number of classes is not a power of two. The desired outputs for seven class discrimination (required in Section 5.3) are shown in Table 4. Actual outputs will not equal desired outputs but should approach desired outputs as training progresses and network accuracy improves. Then, in calculating the estimated variance-covariance matrix for use with the design point criterion, one would expect to find nearly zero off-diagonal elements. If the form is approximately diagonal, the design point criterion can be reduced to ¯ ¯ r ¯X 1 T ¯¯ ¯ arg max N F. F. ¯ , D= ¯ ˜ ii i j ¯ xs ∈R, s=1,...,N ¯ i=1 6
(5.5)
significantly reducing the number of calculations. Design points are then chosen based on this reduced criterion. 5.2 Example Problem Using Reduced Criterion. The four-class problem from Section 4 is used to demonstrate the simplified criterion. The desired outputs for only 50 input vectors were coded as in Table 3, and a multilayer perceptron with 10 hidden nodes was trained. The inverse varianceˆ −1 , estimated using the outputs from the trained netcovariance matrix, 6 work is
ˆ 6
−1
12.4911 = 1.7141 −6.2505
1.7141 34.2288 −2.5606
−6.2505 −2.5606 . 12.9611
(5.6)
Selecting Optimal Experiments
175
This matrix is not approximately diagonal. However, at least four of the elements could be considered 0, reducing the matrix to
ˆ −1 6
12.4911 = 0.0000 −6.2505
0.0000 34.2288 0.0000
−6.2505 0.0000 . 12.9611
(5.7)
The estimated variance-covariance matrix will be different for every multilayer perceptron trained, yet one would expect an approximately equivalent structure. ˆ −1 stems from the The relatively large value of the (2, 2) element in 6 class coding and the multilayer perceptron training. The second desired output is coded 1 for both class 1 and class 2 and 0 for class 3 and class 4. Since the multilayer perceptron can accurately discriminate between a superclass containing classes 1 and 2 and a superclass containing classes 3 and 4 (Fig. 1 shows an example of this phenomenon), the estimated variance of the second output is small. This small variance results in a large value in the inverse matrix. The design criterion reduces to approximately ˆ = |12.4911F.T F.1 − 6.2505F.T F.3 + 34.2288F.T F.2 − 6.2505F.T F.1 |D| 1 1 2 3 + 12.9611F.T3 F.3 | .
(5.8)
The original criterion with 16 terms has been reduced by 69%. The reduced criterion was used to determine 74 design points. Using the truth model, these 74 design points were classified and added to the training set. Figure 3 shows the design points and the resulting multilayer perceptron boundary. Table 5 compares the average test set classification error (after 1000 epochs) for the simplified criterion method, the “full” criterion method, and a randomly selected set of points. 5.3 Armor-Piercing Incendiary Projectile Application. In this section, we demonstrate the use of the design point criterion on a more realistic and challenging problem. An armor-piercing incendiary (API) projectile is a bullet that can perforate light armor. The projectile contains a flammable mixture encased in the nose of the projectile body. The design of the projectile allows the jacket over the nose to deform or peel off on impact of the target skin. Consequently the incendiary mixture will flash as the steel core of the bullet continues its flight. The intense flash following jacket failure of an API is known as incendiary functioning (IF). There are five classifications for IF: complete, partial, slowburn, delayed, and nonfunctioning. Prediction equations for API penetration mechanics are an essential part of aircraft vulnerability analysis. U.S. government agencies perform test shots in order to gather the data to be used in the development of prediction models. These test shots are time-consuming due to the necessary
176
Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck
Figure 3: Design points and resulting boundary for reduced criterion.
Table 5: Summary of Test Set Classification Error (30 Runs). Classification error (%) (30 networks) mean ± std. error
Method
Training set
Initial
50 initial training vectors
Full criterion
50 initial training vectors + 74 chosen points (full criterion)
20.36 ± .49
Simplified criterion
50 initial training vectors + 74 chosen points (simplified criterion)
23.25 ± .52
Random
50 initial training vectors + 74 random points
25.37 ± .55
28.17 ± .68
preparations before each shot, and expensive due to the high costs inherent in the testing itself. This example highlights the real-world benefits of selecting a set of points prior to testing rather than setting up a sequence of test shots, each of which depends on the previous shots. Methods for predicting API functioning have been developed based on the physical characteristics of the system. Force (the pressure opposing the projectile’s penetration) and impulse (the time force is applied) have been
Selecting Optimal Experiments
177
identified as the defining criteria for classifying IF types. Stress (the force per unit area) is given as σ at normal obliquity. The term 1σ is an additive correction given for other obliquity angles. The sum of the two stresses, σ and 1σ , equals the total stress effect. A time parameter t∗ is used in place of impulse. With σ + 1σ and t∗ calculated, stress-time plots are created. These plots delineate zones where the five IF classes occur. Figure 4 provides an example of one of these plots. The time-stress plot in Figure 4 was rescaled and normalized to produce the plot shown in Figure 5. Each of the zones shown is numbered class 1 through class 7. This plot will be used as the truth model, and data will be generated from it. In this fashion, we can generate data with a definite real-world flavor but avoid testing costs. Further, the problem has the added appeal of having a low-dimensional feature space so that graphical presentation of the analysis is feasible. The data were generated by simply drawing two independent random deviates from a uniform distribution bounded between 0.0 and 1.0. These pairs become coordinates on the scaled time-stress plot, and class membership was so determined. We started with 100 randomly drawn data points input to a multilayer perceptron with 10 middle nodes. To study the performance of the reduced criterion, the desired outputs for the 100 input vectors were coded as in Table 4, and a multilayer perceptron with 10 hidden nodes was ˆ −1 , was estimated using trained. The inverse variance-covariance matrix ,6 outputs from the trained network. The matrix was not completely diagonal, however; at least 18 of the elements could be considered 0, reducing the matrix to: 6.0117 0.0000 −3.8187 0.0000 0.0000 0.0000 0.0000 12.5071 0.0000 0.0000 0.0000 4.6405 0.0000 7.6334 −3.1881 0.0000 −3.5436 . (5.9) ˆ −1 = −3.8187 6 0.0000 0.0000 −3.1881 9.3183 0.0000 3.6686 0.0000 0.0000 0.0000 0.0000 10.7487 11.5373 0.0000 4.6405 −3.5436 3.6686 11.5373 82.8441 The resulting reduced design point criterion contains 63% fewer terms than the full criterion. The reduced criterion was used to determine 30 design points. Using the truth model, these 30 design points were classified and added to the training set. Table 6 displays the average test set classification error for the initial data, the initial data plus 30 points selected by the simplified criterion, and the initial data plus 30 randomly selected points. Here, the error when using the simplified criterion method is (in a statistical sense) significantly less than the error for the randomly selected points (α = 0.05). 6 Summary In this article, a Bayesian technique for selecting multilayer perceptron data is developed. The technique selects these training exemplars to best update
178
Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck
Figure 4: Stress-time plot.
the multilayer perceptron weights. At the core of the methodology is a statistical criterion based on minimizing the size of the joint confidence ellipsoid for the weights in the multilayer perceptron. The methodology is illustrated with a simple four-class example. A more substantive seven-class discrimination problem is then discussed dealing with the classification of the performance of armor-piercing incendiary projectiles. The work presented here extends previous research in several areas. First, a Bayesian multivariate criterion is developed for use with multiple output neural networks. Second, the implementation of Powell’s algorithm illustrates a practical method of optimizing the design point criterion. Next, mul-
Selecting Optimal Experiments
179
Figure 5: Truth model for seven-class stress-time plot discrimination problem.
Table 6: Stress-Time Plot Average Test Set Classification Error Comparisons (30 Runs). Classification error (%) (30 networks) Mean ± std. error
Method
Training set
Initial
100 initial training vectors
Simplified criterion
100 initial training vectors + 30 chosen points (simplified criterion)
19.50 ± .85
100 initial training vectors + 30 random points
22.43 ± 1.05
Random
27.38 ± .75
tiple design points are chosen on a single iteration rather than the greedy selection of a single next design point. Having developed a multioutput criterion, we then show how this criterion can be simplified with an elementary modification to the values of the desired outputs. In summary, we show that by choosing data according to a Bayesian criterion, it is possible to achieve significantly better performance for a given amount of data.
180
Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck
Figure 6: Stress-time plot problem: Design points and resulting multilayer perceptron boundary.
Selecting Optimal Experiments
181
Figure 7: Stress-time plot average test set classification error comparisons (30 runs).
References Baum, E. B., and Lang, K. J. 1991. Constructing hidden units using examples and queries. In Advances in Neural Information Processing Systems 3, R. Lippmann, J. Moody, and D. Touretzy, eds., pp. 904–910. Morgan Kaufmann, San Mateo, CA. Beauchamp, K. G. 1984. Applications of Walsh and Related Functions. Academic Press, New York. Belue, L. M. 1995. Selecting Optimal Experiments for Feedforward Multilayer Perceptrons. Ph.D. dissertation. Air Force Institute of Technology, Wright-Patterson AFB, OH. Belue, L. M. and Bauer, K. W., Jr. 1995. Selecting and ranking optimal experiments for single output multilayer perceptrons. Submitted for publication to Journal of Statistical Computation and Simulation. Box, G. E. P., and Draper, N. R. 1965. The bayesian estimation of common parameters from several responses. Biometrika 52, 355–365. Box, M. J., and Draper, N. R. 1972. Estimation and design criteria for multiresponse non-linear models with non-homogeneous variance. Applied Statistics 21, 13–24.
182
Lisa M. Belue, Kenneth W. Bauer, Jr., and Dennis W. Ruck
Box, G. E. P., and Hunter, W. G. 1965. Sequential design of experiments for nonlinear models. Proceedings of IBM Scientific Computing Symposium in Statistics, 113–137. Box, G. E. P., and Lucas, H. L. 1959. Design of experiments in non-linear situations. Biometrika 46, 77–90. Cohn, D. A. 1994. Neural network exploration using optimal experimental design. In Advances in Neural Information Processing Systems, J. Cowan, G. Tesauro, and J. Alspector, eds., pp. 679–686. Morgan Kaufmann, San Mateo, CA. Cohn, D. A., Atlas, L., and Ladner, R. 1990. Training connectionist networks with queries and selective sampling. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 566–573. Morgan Kaufmann, San Mateo, CA. Draper, N. R., and Hunter, W. G. 1966. Design of experiments for parameter estimation in multi-response situations. Biometrika 53, 525–533. Fedorov, V. V. 1972. Theory of Optimal Experiments. Academic Press, New York. Hecht-Nielsen, R. 1989. Neurocomputing. Addison-Wesley, Reading, MA. Hwang, J. N., Choi, J. J., Oh, S., and Marks, R. 1991. Query-based learning applied to partially trained multilayer perceptrons. IEEE Transactions on Neural Networks 2, 131–136. Kinzel, W., and Ruj´an, P. 1990. Improving a network generalization ability by selecting examples. Europhysics Letters 13, 473–477. MacKay, D. J. C. 1992. Information-based objective function for active data selection. Neural Computation 4, 590–604. Plutowski, M., and White, H. 1993. Selecting concise training sets from clean data. IEEE Transactions on Neural Networks 4, 305–318. Powell, M. J. D. 1973. On search directions for minimization algorithms. Math Programming 4, 193–201. Press, W. H. et al. 1989. Numerical Recipes. Cambridge University Press, Cambridge. Reklaitis, G. V., et al. 1983. Engineering Optimization Methods and Applications. John Wiley, New York. Roberts, J. E. 1989. Influence of target-vector code selection on the performance of a neural-network word recognizer. In Proceedings of IEEE/INNS International Joint Conference on Neural Networks, II-628. Rogers, S. K., and Kabrisky, M. 1991. An Introduction to Biological and Artificial Neural Networks for Pattern Recognition. SPIE Optical Engineering Press, Bellingham, WA. Rummelhart, D. 1995. Backpropagation: Theory, Architectures, and Applications. Erlbaum, Hillsdale, NJ. St. John, R. C., and Draper, N. R. 1975. D-optimality for regression designs: A review. Technometrics 17, 15–23. Seber, G. A. F. and Wild, C. J. 1989. Nonlinear Regression. John Wiley, New York. Sollich, P. 1994. Query construction, entropy and generalization in neural network models. Physical Review E. 49, 4637–4651. Steppe, J. M., Bauer, K. W., Jr., and Rogers, S. K. 1995. Integrated feature and architecture selection. Submitted to IEEE Transactions on Neural Networks.
Selecting Optimal Experiments
183
Vanderplaats, G. N. 1984. Numerical Optimization Techniques for Engineering Design with Applications. McGraw-Hill, New York. Watkin, T. L. H., and Rau, A. 1992. Selecting examples for perceptrons. J. Phys. A: Math. Gen. 25, 113–121.
Received April 7, 1995; accepted April 18, 1996.
Communicated by Jude Shavlik
A Penalty-Function Approach for Pruning Feedforward Neural Networks Rudy Setiono Department of Information Systems and Computer Science, National University of Singapore, Kent Ridge, Singapore 0511, Republic of Singapore
This article proposes the use of a penalty function for pruning feedforward neural network by weight elimination. The penalty function proposed consists of two terms. The first term is to discourage the use of unnecessary connections, and the second term is to prevent the weights of the connections from taking excessively large values. Simple criteria for eliminating weights from the network are also given. The effectiveness of this penalty function is tested on three well-known problems: the contiguity problem, the parity problems, and the monks problems. The resulting pruned networks obtained for many of these problems have fewer connections than previously reported in the literature.
1 Introduction We are concerned in this article with finding a minimal feedforward backpropagation neural network for solving the problem of distinguishing patterns from two or more sets in N-dimensional space. Backpropagation feedforward neural networks have been gaining acceptance as the method of choice for this pattern classification problem. One of the difficulties with using feedforward neural networks is the need to determine the optimal number of hidden units before the training process can begin. Too many hidden units may lead to overfitting of the data and poor generalization; too few hidden units may not give a network that learns the data. Two different approaches have been proposed to overcome the problem of determining the optimal number of hidden units required by an artificial neural network to solve a given problem. The first approach begins with a minimal network and adds more hidden units only when they are needed to improve the learning capability of the network. The second approach begins with an oversized network and then prunes redundant hidden units. Much research has been done on algorithms that start with a minimal network and dynamically construct the final network. These algorithms include the dynamic node creation (Ash 1989), the cascade correlation algoNeural Computation 9, 185–204 (1997)
c 1997 Massachusetts Institute of Technology °
186
Rudy Setiono
rithm (Fahlman and Lebiere 1990), the tiling algorithm (Mezard and Nadal 1989), the self-organizing neural network (Tenorio and Lee 1990), and the upstart algorithm (Frean 1990). The aim of these algorithms is to find a network with the simplest architecture possible that is capable of correctly classifying all its input patterns. Interest in algorithms that remove hidden units from an oversized network has also been growing (Chauvin 1989; Chung and Lee 1992; Mozer and Smolensky 1989; Hanson and Pratt 1989). Instead of removing unnecessary hidden units, individual weights (connections) in the network may also be removed to increase the generalization capability of a neural network (Le Cun et al. 1990; Hassibi and Stork 1993; Thodberg 1991). Typically, methods for removing weights from the network involve adding a penalty term to the error function (Hertz et al. 1991). It is hoped that by adding a penalty term to the error function, unnecessary connections will have small weights, and therefore the complexity of the network can be reduced significantly by pruning. The simplest and most commonly used penalty term is the sum of the squared weights. After the training process has been completed—it has converged to a local minimum point—a measure of the saliency of each connection in the network is computed. The saliency measure gives an indication on the expected increase in the error function after the corresponding connection is eliminated from the network. In the pruning methods optimal brain damage (Le Cun et al. 1990) and optimal brain surgeon (Hassibi and Stork 1993), the saliency of each connection is computed using a second-order approximation of the error function near a local minimum. If the saliency of a connection is below a certain threshold, the connection is removed from the network. The resulting network needs to be retrained if the increase in the error function is larger than a predetermined acceptable increase. In this article, a simple scheme for removing redundant weights from a network that has been trained to minimize a penalty function is proposed. The effectiveness of the proposed pruning scheme is tested on several wellknown problems such as the contiguity problem, the monks problems, and the parity problems. In many cases, the resulting networks contain fewer connections than previously reported in the literature. The organization of this article is as follows. In Section 2, we describe the penalty function and network pruning scheme. Experimental results are reported in Section 3. Finally in Section 4, a brief discussion and final remarks are given.
2 Neural Network Training and Pruning 2.1 Neural Network Training. Let us consider the fully connected threelayer network depicted in Figure 1. The error function associated with this
A Penalty-Function Approach for Pruning Feedforward Neural Networks 187
Figure 1: Fully connected feedforward neural network with three hidden units and three output units.
network is normally defined as the sum of the squared errors: E(w, v) =
C k X 1X (Spi − tpi )2 , 2 i=1 p=1
(2.1)
where: • k is the number of patterns. • C is the number of output units. • tpi = 0 or 1 is the target value for pattern xi at output unit p, p = 1, 2, . . . , C. • Spi is the output of the network at output unit p. h X T f ((xi ) wj )vpj Spi = f j=1
(2.2)
188
Rudy Setiono
• h is the number of hidden units in the network. • xi is an N-dimensional input pattern, i = 1, 2, . . . , k. • wm is an N-dimensional vector of weights for the arcs connecting the input layer and the mth hidden unit, m = 1, 2 . . . , h. The weight of the connection from the `th input unit to the mth hidden unit is denoted by wm` . • vm is a C-dimensional vector of weights for the arcs connecting the mth hidden unit and the output layer. The weight of the connection from the mth hidden unit to the pth output unit is denoted by vpm . The activation function f (y) is usually the sigmoid function, σ (y) = 1/(1 + e−y ), or the hyperbolic tangent function, δ(y) = (ey − e−y )/(ey + e−y ). The main difference between these two functions is their range. The range of σ (y) is [0, 1], while the range of δ(y) is [−1, 1]. Since we will be using the cross-entropy error function, which involves taking the logarithm of the predicted network outputs, the sigmoid function is applied at the output units. Karnin (1990) used the hyperbolic tangent function as the activation function at the hidden units. We have also used this function to obtain all the results reported in this paper. It has been suggested by several authors (Lang and Witbrock 1988; van Ooyen and Nienhuis 1992) that the cross-entropy error function F(w, v) = −
C k X X
tpi log Spi + (1 − tpi ) log(1 − Spi )
(2.3)
i=1 p=1
improves the convergence of the training process. The components of the gradient of this function are ∂Spi ∂F(w, v) ∂F(w, v) = × ∂wm` ∂Spi ∂wm` =
C k X X [epi × vpm × (1 − δ(xi wm )2 ) × xi` ] i=1 p=1
∂Spi ∂F(w, v) ∂F(w, v) = × ∂vpm ∂Spi ∂vpm =
k X [epi × δ(xi wm )] i=1
A Penalty-Function Approach for Pruning Feedforward Neural Networks 189
for all m = 1, 2, . . . , h, ` = 1, 2, . . . , n, and p = 1, 2, . . . , C, with the error for each pattern at output unit p defined as epi = Spi − tpi , i = 1, 2, . . . , k. The number of output units C depends on the number of different classes in the data. When there are only two classes, a single output unit is sufficient. Each pattern can be given a target value equal to 0 or 1 depending on its class label. When there are more than two classes, C is set to the number of classes. Each pattern xi in class Cc , c = 1, 2, . . . , C is given a C-dimensional target vector ti such that tpi = 1 if p = c and tpi = 0 otherwise. The difference between the gradient of the functions E(w, v) and F(w, v) is the presence of the product Spi (1 − Spi ) in the gradient of the sum of the squared error function E(w, v). As pointed out in van Ooyen and Nienhuis (1992), when the values of Spi are converging to either 0 or 1, the magnitude of the gradient of E(w, v) becomes small. This leads to a slowdown in the convergence of any minimization algorithm that utilizes gradient information to move from one iterate to the next. 2.2 Choice of Penalty Function and Criteria for Weight Elimination. When a network is to be pruned, it is a common practice to add a penalty term to the error function during training. Usually the penalty term is the sum of the squared weights n C h X h X X X 2 w2m` + vpm , (2.4) P1 (w, v) = ²/2 m=1 `=1
m=1 p=1
where ² is a small positive weight decay constant (Hinton 1989). This quadratic penalty term is used to discourage the weights from taking large values. If second-order approximations of the error function are employed to find a local minimum of the error function, the addition of this quadratic term also contributes to the stability of the training process. The penalty term 2.4 adds ² to the diagonal of the second derivative matrix of the error function. With this perturbation, it is more likely for the matrix to be positive definite, and hence, a descent direction can be obtained. There are some drawbacks to using this penalty term. When the backpropagation method is used to train the neural network, the addition of this quadratic penalty term will cause all the weights to decay exponentially to zero at the same rate (Hanson and Pratt 1989). The quadratic penalty also disproportionately penalizes large weights. To remedy this problem, the following penalty function has been suggested (Weigend et al. 1990): 2 n C h X h X X X vpm w2m` . + (2.5) P2 (w, v) = ²/2 2 2 1 + vpm m=1 `=1 1 + wm` m=1 p=1
190
Rudy Setiono
Since the value of the function f (w) = w2 /(1 + w2 ) is small when w is close to zero and approaches 1 as w becomes large, the function 2.5 can be considered a measure of the total number of nonzero weights in the network. The derivative function f 0 (w) = 2w/(1 + w2 )2 indicates that the backpropagation training will be very little affected with the addition of the penalty function 2.5 for weights having large values. Selecting suitable values for the backpropagation learning rate and ² will cause small weights to decay at a higher rate than large weights. A disadvantage of this penalty function is that it does not make any distinction between large and very large weights; that is, there exists a sufficiently large w1 such that for all |w| ≥ w1 , 1 − ν ≤ f (w) ≤ 1 for any small positive scalar ν. The reason we would like to have weights that are not too large is linked to the criteria for weight elimination developed below. Recall that given an input pattern xi , the output of the network at unit p is given according to the equation h X T δ((xi ) wj )vpj . Spi = σ
(2.6)
j=1
The derivatives of Spi with respect to the weights of the network are as follows: ∂Spi = Spi (1 − Spi )vpm xi` (1 − δ(xi wm )2 ) ∂wm` ½ ∂Spi if p = q S (1 − Spi )δ(xi wm ) = pi 0 otherwise. ∂vqm
(2.7) (2.8)
For the moment, let us consider Spi as a function of a single variable corresponding to the connection between the `th input unit and the mth hidden unit. From the mean value theorem (Grossman 1995, p. 173), we have that Spi (w) = Spi (wm` ) +
∂Spi (wm` + λ(w − wm` )) (w − wm` ), ∂wm`
(2.9)
where 0 < λ < 1. The maximum value of the product Spi (1 − Spi ) is 0.25, and the maximum value of (1 − δ(xi wm )2 ) is 1. Assuming that xi` ∈ [0, 1], from equation 2.7 we obtain the bound ¯ ¯ ¯ ∂Spi (w) ¯ ¯ ≤ |vpm |/4, ∀w ∈ R. ¯ (2.10) ¯ ¯ ∂w m` Combining equations 2.9 and 2.10 and letting w equal to zero, we obtain ¯ ¯ ¯Spi (0) − Spi (wm` )¯ ≤ |vpm wm` |/4. (2.11)
A Penalty-Function Approach for Pruning Feedforward Neural Networks 191
Equation 2.11 gives an upper bound on the change in the output of the network when the weight wm` is eliminated. Similarly, by considering Spi as a function of a single variable vpm that corresponds to the connection between the mth hidden unit and the pth output unit, from equation 2.8 we have the bound ¯ ¯ ¯ ∂Spi (v) ¯ ¯ ≤ 1/4, ∀v ∈ R. ¯ ¯ ¯ ∂v pm
(2.12)
Hence, the change in the pth output of the network after the weight vpm has been eliminated is bounded by ¯ ¯ ¯Spi (0) − Spi (vpm )¯ ≤ |vpm |/4.
(2.13)
A pattern is correctly classified if the following condition is satisfied, |epi | = |Spi − tpi | ≤ η1 , p = 1, 2, . . . , C,
(2.14)
where η1 ∈ [0, 0.5). Suppose now that with the original fully connected network, we are able to meet a prespecified accuracy requirement. The bound in equation 2.11 shows that we can set wm` to zero without deteriorating the classification rate of the network if the product |vpm wm` | is sufficiently small. Similarly, the bound in equation 2.13 suggests that vpm can be set to zero if its magnitude is sufficiently small. We have the following two propositions. Proposition 1. Let (w, v) be a set of weights of a network with an accuracy rate of R, with the condition in equation 2.14 satisfied by each correctly classified input pattern. Let maxp |vpm wm` | ≤ 4η2 , and let S˜pi be the output of the network with wm` set to zero; then for each pattern xi that is correctly classified by the original network, the following holds: |˜epi | = |S˜pi − tpi | ≤ η1 + η2 , p = 1, 2, . . . , C. Proposition 2. Let (w, v) be a set of weights of a network with an accuracy rate of R, with the condition in equation 2.14 satisfied by each correctly classified input pattern. Let |vpm | ≤ 4η2 , and let S˜pi be the output of the network with vpm set to zero. Then for each pattern xi that is correctly classified by the original network, the following holds: |˜epi | = |S˜pi − tpi | ≤ η1 + η2 , p = 1, 2, . . . , C. As long as the sum η1 + η2 is less than 0.5, Propositions 1 and 2 ensure that the pruned network still achieves an accuracy rate of R. After wm` or vpm is set to zero, the absolute errors of each correctly classified pattern can
192
Rudy Setiono
be made arbitrarily small by increasing the weights from the hidden units to the output units, as Proposition 3 suggests. Proposition 3. Let (w, v) be a set of weights of a network with an accuracy rate of R such that for each correctly classified pattern xi , the following condition holds: |epi | = |Spi − tpi | ≤ η, p = 1, 2, . . . , C, for some η ∈ [0, 0.5). For any ρ ∈ (0, 1), there exists a sufficiently large scalar λ > 0 such that |ˆepi | = |Sˆpi − tpi | ≤ ρη, where Sˆpi is the output of a new network with weights (w, λv). Proposition 3 guarantees the existence of a new set of weights that still gives an accuracy rate of R with the same error tolerance in equation 2.14. It will be prudent to retrain the network to find this new set so that the weights of connections between the hidden units and output units v will not become too large. Simply multiplying v by some large λ will hinder the process of removing subsequent weights from the network. We summarize our pruning algorithm below. Algorithm N2P2F: Neural Network Pruning with Penalty Function 1. Let η1 and η2 be positive scalars such that η1 + η2 < 0.5. (η1 is the error tolerance; η2 is a threshold that determines if a weight can be removed). 2. Pick a fully connected network, and train this network to meet a prespecified accuracy level with the condition in equation 2.14 satisfied by all correctly classified input patterns. Let (w, v) be the weights of this network. 3. For each wm` in the network, if max |vpm wm` | ≤ 4η2 , p
(2.15)
then remove wm` from the network. 4. For each vpm in the network, if |vpm | ≤ 4η2 , then remove vpm from the network.
(2.16)
A Penalty-Function Approach for Pruning Feedforward Neural Networks 193
5. If no weight satisfies the conditions in equations 2.15 or 2.16, then for each wm` in the network, compute ωm` = max |vpm wm` |. p
Remove wm` with the smallest ωm` . 6. Retrain the network. If the classification rate of the network falls below the specified level, then stop and use the previous setting of network weights. Otherwise, go to Step 3. To reduce the retraining time, we remove all the weights that satisfy the conditions in equations 2.15 or 2.16 in Steps 3 and 4 of the algorithm. Although it can no longer be guaranteed that the accuracy of the retrained network will not drop, our experiments show that many weights can be eliminated simultaneously without deteriorating the performance of the network. This is especially true when the starting fully connected network has an excessive number of redundant hidden units. Because the two conditions in equations 2.15 and 2.16 for pruning depend on the magnitude of the weights for connections between the input units and the hidden units (see equation 2.15) and between the hidden units and the output units (see equation 2.16), it is imperative that during training these weights be prevented from getting too large. At the same time, small weights should be encouraged to decay rapidly to zero. To achieve these goals, we make use of both penalty functions in equations 2.4 and 2.5 to obtain the following function:
P(w, v) = ²1
h X
n X
m=1
`=1
+²2
h X
m=1
βw2m` 1 + βw2m`
n X `=1
w2m` +
+
C X
C X
2 βvpm
2 1 + βvpm
p=1
2 vpm .
(2.17)
p=1
Hence, the complete error function to be minimized during the training process is
θ(w, v) = P(w, v) −
C k X X (tpi log Spi + (1 − tpi ) log(1 − Spi )).
(2.18)
i=1 p=1
The values for the weight decay parameters ²1 , ²2 > 0 must be chosen to reflect the relative importance of the accuracy of the network versus its complexity. These parameters also determine the range of values where the penalty for each weight in the network is approximately equal to ²1 .
194
Rudy Setiono
Figure 2: Plots of the function f (w) = ²1 βw2 /(1 + βw2 ) + ²2 w2 and its derivative. (a, b): ²1 = 0.1, ²2 = 10−5 , and β = 10. (c, d): ²1 = 0.1, ²2 = 10−6 , and β = 10.
Consider the plot of the function f (w) = ²1 βw2 /(1 + βw2 ) + ²2 w2 , where ²1 = 0.1, ²2 = 10−5 , and β = 10 in Figure 2a. This function intersects the horizontal line f = ²1 at w∗ ≈ ±5.62. If ν is a small positive scalar, then we can find the values of wL > 0 and wU > 0 such that for all values of w, |w| ∈ (wL , wU ), ²1 −ν ≤ f (w) ≤ ²1 +ν. This is the penalty corresponding to a nonzero connection in the network. In this interval, almost no distinction is made between the penalty for smaller or larger weights. For example, when ν = 10−2 , the penalty for any weight with magnitude in the interval (wL , wU ) = (0.95, 31.64) is within 10% of ²1 . By decreasing the value of ²2 , the interval over which the penalty value is approximately equal to ²1 can be made wider, as shown in Figure 2c, where ²2 = 10−6 . A weight is prevented from taking a value that is too large, since the quadratic term becomes dominant for values of w that are greater than wU . The value of
A Penalty-Function Approach for Pruning Feedforward Neural Networks 195
wL can be made arbitrarily small by increasing the value of the parameter β. The derivative of the function f (w) near zero is relatively large (Figs. 2b and 2d). This will give a small weight w stronger tendency to decay to zero. 3 Experimental Results We selected well-known problems to test the pruning algorithm described in the previous section: the contiguity problem, the 4-bit and 5-bit parity problems, and the monks problems. Since we were interested in finding the simplest network architecture that could solve these problems, all available input data were used for training and pruning. An exception was the monks problems. For the three monks problems, the division of the data sets into training and testing sets has been fixed (Thrun et al. 1991). There was one output unit in the networks. For all problems, thresholds or biases in the hidden units were incorporated into the network by appending a 1 to each input pattern. The threshold value of the output unit was set to 0 to simplify the implementation of the training and pruning algorithms. Only two different sets of parameters were involved in the error function. They corresponded to the weights of the connections between the input layer and the hidden layer, and between the hidden layer and the output layer. During training, the gradient of the error function was computed by taking partial derivatives of the function with respect to these parameters only. During pruning, conditions were checked to determine whether a connection could be removed. There was no special condition for checking whether a hidden unit bias could be removed. Note that if after pruning, there exists a hidden unit connected only to the input unit with a constant input value of 1 for all patterns and the output unit, then the weights of two arcs connected to this hidden unit determine a nonzero threshold value at the output unit. The function θ(w, v) (cf. equation 2.18) was minimized by a variant of the quasi-Newton algorithm, the BFGS method. It has been shown that quasiNewton algorithms, such as the BFGS method, can speed up the training process significantly (Watrous 1987). At each iteration of the algorithm, a positive definite matrix that is an approximation of the inverse of the Hessian of the function to be minimized is computed. The positive definiteness of this matrix ensures that a descent direction can be generated. Given a descent direction, a step size is computed via an inexact line search algorithm. Details of this algorithm can be found in Dennis and Schnabel (1983). For all the experiments reported here, we used the same values for the parameters involved in the function θ (w, v). These were ²1 = 0.1, ²2 = 10−5 , and β = 10. During the training process, the BFGS algorithm was terminated when the following condition was satisfied: k∇θ(w, v)k ≤ 10−8 max{1, kw, vk}, where k∇θ(w, v)k is the 2-norm of the gradient of θ (w, v). At this point the
196
Rudy Setiono
accuracy of the network was checked. The value of η1 (cf. equation 2.14) was 0.35. If the classification rate of the network met the prespecified required accuracy, then the pruning process was initiated. The required accuracy of the original fully connected network was 100% for all problems, except for the monks 3 problem. For the monks 3 problem, the acceptable accuracy rate was 95%. We did not attempt to get 100% accuracy rate for this problem due to the presence of six incorrect classifications (out of 122 patterns) in the training set. The parameter η2 used during the pruning process was set at 0.10. Typically, at the start of the pruning process, many weights would be eliminated because they satisfied the condition in equation 2.15. Subsequently, weights were removed one at a time based on the magnitude of the product |vpm wm` |. Before the actual pruning and retraining was done, the current network was saved. Should pruning additional weight(s) and retraining the network fail to give a new set of weights that met the prespecified accuracy requirement, the saved network would be considered the smallest network for this particular run of the experiment. 3.1 The Contiguity Problem. This problem has been the subject of several studies on network pruning (Thodberg 1991; Gorodkin et al. 1993). The input patterns were binary strings of length 10. Of all possible 1024 patterns, only those with two or three clumps were used. A clump was defined as a block of 1’s in the string separated by at least one 0. The target value ti for each of the 360 patterns with two clumps was 0, and the target value for each of the 432 patterns with three clumps was 1. A sparse symmetric network architecture with nine hidden units and a total of 37 weights has been suggested for this problem (Gorodkin et al. 1993). Two experiments were carried out using this data set as input. In the first experiment, 50 neural networks having six hidden units were used as the starting networks. In the second experiment, the same number of networks, each with nine hidden units, were used. The accuracy rate of all these networks, which were trained starting from different random initial weights, was 100%. The number of connections and hidden units left in the networks after pruning are tabulated in Table 1. Thodberg (1991) trained 38 fully connected networks with 15 hidden units using 100 patterns as training data. The trained networks were then pruned by a brute-force scheme, and the average number of connections left in the pruned network was 32. This number, however, did not include the biases in the networks. Employing the optimal brain damage pruning scheme, Gorodkin et al. (1993) obtained pruned networks with a number of weights ranging from 36 to more than 90. These networks were trained on 140 examples. In contrast, the maximum number of connections left in our networks was 33, and only one of the pruned networks had this many connections. The results of the experiments with different initial number of hidden units in the networks show the robustness of our pruning algorithm.
A Penalty-Function Approach for Pruning Feedforward Neural Networks 197 Table 1: Average Number of Connections and Hidden Units in 50 Networks Trained to Solve the Contiguity Problem after Pruning. Number of starting hidden units Number of connections before pruning Average number of connections after pruning Average number of hidden units after pruning
6
9
72 28.44 (1.33) 5.98 (0.14)
108 29.18 (1.32) 7.62 (0.75)
Note: Figures in parentheses indicate standard deviations.
Starting with six or nine hidden units, the networks were pruned until there were on average 29 connections left. 3.2 The Parity Problem. The parity problem is a well-known difficult problem that has often been used for testing the performance of a neural network training algorithm. The input set consists of 2N patterns in Ndimensional space, and each pattern is an N-bit binary vector. The target value ti is equal to 1 if the number of 1’s in the pattern is odd, and it is 0 otherwise. We used the 4-bit and 5-bit parity problems to test our pruning algorithm. For the 4-bit parity problem, three sets of experiments were conducted. In the first set, 50 networks each with four hidden units were used as the starting networks. In the second and third sets of experiments, the initial number of hidden units were five and six. Similarly, three sets of experiments were conducted for the 5-bit parity problem. The initial number of hidden units in 50 starting networks for each set of experiment were five, six, and seven, respectively. The average number of weights and hidden units of the networks after pruning are shown in Table 2. In this table, we also show the average number of hidden units left after pruning. Many experiments using backpropagation network assume that N hidden units are needed to solve the N-bit parity problem. Previously published neural network pruning algorithms have also failed to obtain networks with fewer than n hidden units (Chung and Lee 1992; Hanson and Pratt 1989). In fact, three hidden units are sufficient for both the 4-bit and 5-bit parity problems. 3.3 The Monks Problems. The monks problems (Thrun et al. 1991) are an artificial robot domain, in which robots are described by six different attributes: A1 : head shape ∈ round, square, octagon; A2 : body shape ∈ round, square, octagon;
198
Rudy Setiono
Table 2: Average Number of Connections and Hidden Units Obtained from 50 Networks for the 4-bit and 5-bit Parity Problems After Pruning. Parity 4 Number of starting hidden units Number of connections before pruning Average number of connections after pruning Average number of hidden units after pruning
4
5
6
24
30
36
17.40 (1.55)
18.12 (1.60)
18.76 (2.30)
3.42 (0.50)
3.64 (0.66)
3.98 (0.77)
5
6
7
35
42
49
21.80 (3.14)
21.84 (3.02)
22.34 (3.17)
3.58 (0.81)
3.68 (0.84)
3.82 (0.83)
Parity 5 Number of starting hidden units Number of connections before pruning Average number of connections after pruning Average number of hidden units after pruning
A3 : is smiling ∈ yes, no; A4 : holding ∈ sword, balloon, flag; A5 : jacket color ∈ red, yellow, green, blue; A6 : has tie ∈ yes, no. The learning tasks of the three monks problems are of binary classification, each given by the following logical description of a class. • Monks 1 problem: (head shape = body shape) or (jacket color = red). From 432 possible samples, 124 were randomly selected for the training set. • Monks 2 problem: Exactly two of the six attributes have their first value. From 432 samples, 169 were selected randomly for the training set. • Monks 3 problem: (Jacket color is green and holding a sword) or (jacket color is not blue and body shape is no octagon). From 432 samples, 122 were selected randomly for training and among them there were 5% misclassifications (i.e., noise in the training set).
A Penalty-Function Approach for Pruning Feedforward Neural Networks 199
The testing set for the three problems consisted of all 432 samples. Two sets of experiments were conducted on the monks 1 and monks 2 problems. In the first set, 50 networks, each with three hidden units, were used as the starting networks for pruning. In the second set, an equal number of networks with five hidden units were used. These networks correctly classified all patterns in the training set. For the monks 3 problem, in addition to the two sets of experiments, 50 networks with just one hidden unit were also trained as the starting networks. All starting networks for monks 3 had an accuracy rate of at least 95% on the training data. The required accuracy of the pruned networks for the monks 1 and monks 2 problems on the training patterns was 100%. For the monks 3 problem, it was 95%. The results from the experiments for monks 1 and monks 2 problems are summarized in Table 3. The results for the monks 3 problem are tabulated in Table 4. The P-values in these tables had been computed to test whether the accuracy of the networks on the testing set increased significantly after pruning. With the exception of the networks with one hidden unit for the monks 3 problem, the P-values clearly show that pruning did increase the predictive accuracy of the networks. There was not much difference in the average number of connections left after pruning between networks that originally had three hidden units and those with five hidden units. For the monks 3 problem, when the starting networks had only one hidden unit, the average number of connections left was 9.92, significantly fewer than the averages obtained from networks with 3 and 5 starting hidden units. Part of this difference can be accounted for by the number of hidden units still left in the pruned networks. Most of the networks with three or five starting hidden units still had three hidden units after pruning. The connections from these hidden units to the output unit added on average two more connections to the total. The smallest pruned networks with 100% accuracy on the training and the testing sets for the monks 1 and monks 2 problems had 11 and 15 connections, respectively. For the monks 3 problem, the pruning algorithm found a network with only 6 connections. This network was able to identify the 6 noise cases in the training set. The predicted outputs for these noise patterns were the opposite of the target outputs, and hence an accuracy rate of 95.08% was obtained on the training set. Since there was no noise in the testing data, the predictive accuracy rate was 100%. The network is depicted in Figure 3. For comparison, the backpropagation with weight decay and the cascade-correlation algorithms both achieved 97.2% accuracy on the testing set (Thrun et al. 1991). Some interesting observations on the relationship between the number of hidden units and the generalization capability of a network were made by Weigend (1993). One of these observations is that large networks perform better than smaller networks provided that the training of the networks is stopped early. To determine when the training should be terminated, a second data set—the cross-validation set—is used. Our results on the monks
200
Rudy Setiono
Table 3: Average Number of Connections and Predictive Accuracy Obtained from 50 Networks for the Monks 1 and Monks 2 Problems After Pruning. Monks 1 Number of starting hidden units Number of connections before pruning Average accuracy on testing set (%) Average number of connections after pruning Average accuracy on testing set (%) P-value
3
5
57 99.10 (2.66) 12.96 (1.63) 99.82 (0.74) 0.033
95 99.52 (1.44) 12.54 (1.05) 100.00 (0.00) 0.009
3
5
57 98.91 (1.47) 16.76 (1.81) 99.44 (0.78) 0.012
95 98.13 (2.32) 16.86 (1.40) 99.39 (0.70) 0.001
Monks 2 Number of starting hidden units Number of connections before pruning Average accuracy on testing set (%) Average number of connections after pruning Average accuracy on testing set (%) P-value
Note: P-values are computed for testing if the predictive accuracy rates of the networks increase significantly after pruning.
Figure 3: A network with six weights for the monks 3 problem. The number next to a connection shows the weight for that connection. The accuracy rates on the training set and the testing set are 95.08 and 100%, respectively.
A Penalty-Function Approach for Pruning Feedforward Neural Networks 201 Table 4: Average Number of Connections and Predictive Accuracy Obtained from 50 Networks for the Monks 3 Problem After Pruning. Monks 3 Number of starting hidden units Number of connections before pruning Average accuracy on training set (%) Average accuracy on testing set (%) Average number of connections after pruning Average accuracy on training set (%) Average accuracy on testing set (%) P-value
1
3
5
19
57
95
95.85 (0.69)
99.02 (1.09)
99.92 (0.38)
96.07 (2.43)
93.43 (1.91)
92.82 (1.53)
9.92 (2.06)
14.64 (2.79)
14.86 (2.19)
95.82 (0.69)
96.00 (1.07)
96.46 (1.12)
96.11 (2.55) 0.468
94.12 (2.95) 0.082
93.85 (2.21) 0.003
Note: P-values are computed for testing if the predictive accuracy rates of the networks increase significantly after pruning.
problems are mixed. For the monks 1 problem, networks with five hidden units have higher predictive accuracy than those with three hidden units. For the monks 2 problem, the reverse is true. For the monks 3 problem, where there is noise in the data, there is a trend that larger networks have poorer predictive accuracy rates than smaller ones, even after the networks have been pruned. Note that the statistics were obtained from networks that had been trained to reach a local minimum of the augmented error function. One clear conclusion that can be drawn from the data in Tables 3 and 4 is that pruned networks have better generalization capability than the fully connected networks. 4 Discussion and Final Remarks We presented a penalty function approach for feedforward neural network pruning. The penalty function consists of two components. The first component is to discourage the use of unnecessary connections, and the second component is to prevent the weights of these connections from taking excessively large values. Simple criteria for eliminating network connections are also given. The two components of the penalty function have been used individually in the past. However, applying our approach that combines the penalty function and the magnitude-based weight elimination criteria, we are able to get smaller networks than those reported in the literature. We
202
Rudy Setiono
have also experimented pruning with one of the two penalty parameters (²1 and ²2 ) set to zero; the results were not as good as those reported in Section 3. The approach described in this article works for networks with one hidden layer. However, it is possible to extend the algorithm so that it works for a network with N > 1 hidden layers. Let us label the input layer L0 , the output layer LN+1 , and the hidden layers L1 , L2 , . . . , LN . Connections are present only between units in layers Li and Li+1 . A criterion for removal of a connection between unit H1 in layer Li and unit H2 in layer Li+1 can be developed to make sure that the changes in the predicted outputs are not more than η2 . The effect of such a removal will be propagated layer by layer from H2 to the output units in LN+1 . Hence, the criterion for a removal will involve the weights from unit H2 to layer Li+2 and all the weights from layer Li+2 onward. The effectiveness of the pruning algorithm using our proposed penalty function has been demonstrated by its performance on three well-known problems. Using the same set of penalty parameters, networks that solve three test problems—the contiguity problem, the parity problems, and the monks problems—with few connections have been obtained. The small number of connections in the pruned networks allows us to develop an algorithm to extract compact rules from networks that have been trained to solve real-world classification problems (Setiono 1997).
References Ash, T. 1989. Dynamic node creation in backpropagation networks. Connection Science 1(4), 365–375. Chauvin, Y. 1989. A back-propagation algorithm with optimal use of hidden units. In Advances in Neural Information Processing Systems, Vol. 1, pp. 519– 526. Morgan Kaufmann, San Mateo, CA. Chung, F. L., and Lee, T. 1992. A node pruning algorithm for backpropagation networks. Int. J. Neural Systems 3(3), 301–314. Dennis, J. E., and Schnabel, R. B. 1983. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice Hall, Englewood Cliffs, NJ. Fahlman, S. E., and Lebiere, C. 1990. The cascade-correlation learning architecture. In Advances in Neural Information Processing Systems, Vol. 2, pp. 524–532. Morgan Kaufmann, San Mateo, CA. Frean, M. 1990. The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Computation 2(2), 198–209. Gorodkin, J., Hansen, L. K., Krogh, A., and Winther, O. 1993. A quantitative study of pruning by optimal brain damage. Int. J. Neural Systems 4(2), 159– 169. Grossman, S. I. 1995. Multivariable Calculus, Linear Algebra, and Differential Equations. Saunders College Publishing, Orlando, FL.
A Penalty-Function Approach for Pruning Feedforward Neural Networks 203 Hanson, S. J., and Pratt, L. Y. 1989. Comparing biases for minimal network construction with back-propagation. In Advances in Neural Information Processing Systems, Vol. 1, pp. 177–185. Morgan Kaufmann, San Mateo, CA. Hassibi, B., and Stork, D. G. 1993. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems, Vol. 5, pp. 164–171. Morgan Kaufmann, San Mateo, CA. Hertz, J., Krogh, A., and Palmer, R. G. 1991. In Introduction to the Theory of Neural Computation. Addision-Wesley, Redwood City, CA. Hinton, G. E. 1989. Connectionist learning procedures. Artificial Intelligence 40, 185–234. Karnin, E. D. 1990. A simple procedure for pruning back-propagation trained neural networks. IEEE Trans. Neural Networks 1(2), 239–242. Lang, K. J., and Witbrock, M. J. 1988. Learning to tell two spirals apart. In Proc. of the 1988 Connectionist Summer School, pp. 52–59. Morgan Kaufmann, San Mateo, CA. Le Cun, Y., Denker, J. S., and Solla, S. A. 1990. Optimal brain damage. In Advances in Neural Information Processing Systems, Vol. 2, pp. 598–605. Morgan Kaufmann, San Mateo, CA. Mezard, M., and Nadal, J. P. 1989. Learning in feedforward layered networks: The tiling algorithm. Journal of Physics A 22(12), 2191–2203. Mozer, M. C., and Smolensky, P. 1989. Skeletonization: A technique for trimming the fat from a network via relevance assestment. In Advances in Neural Information Processing Systems, Vol. 1, pp. 107–115. Morgan Kaufmann, San Mateo, CA. Setiono, R. 1997. Extracting rules from neural networks by pruning and hidden unit splitting. Neural Computation 9, 205–225. Tenorio, M. F., and Lee, W. 1990. Self-organizing network for optimum supervised learning. IEEE Trans. Neural Networks 1(1), 100–110. Thodberg, H. H. 1991. Improving generalization of neural networks through pruning. Int. J. Neural Systems 1(4), 317–326. Thrun, S. B., Bala, J., Bloedorn, E., Bratko, I., Cestnik, B., Cheng, J., De Jong, K., Dˇzeroski, S., Fahlman, S. E., Fisher, D., Hamann, R., Kaufman, K., Keller, S., Kononenko, I., Kreuziger, J., Michalski, R. S., Mitchell, T., Pachowicz, P., Reich, Y., Vafaie, H., Van de Welde, W., Wenzel, W., Wnek, J., and Zhang, J. 1991. The MONK’s problems—a performance comparison of different learning algorithms. Preprint CMU-CS-91-197. Carnegie Mellon University, Pittsburgh, PA. van Ooyen, A., and Nienhuis, B. 1992. Improving the convergence of the backpropagation algorithm. Neural Networks 5, 465–471. Watrous, R. 1987. Learning algorithms for connectionist networks: Applied gradient methods of nonlinear optimization. Proc. IEEE First International Conference on Neural Networks, pp. 619–627. IEEE Press, New York. Weigend, A. S. 1993. On overfitting and the effective number of hidden units. In Proc. of the 1993 Connectionist Models Summer School, pp. 335–342. Lawrence Erlbaum, Hillsdale, NJ.
204
Rudy Setiono
Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. 1990. Back-propagation, weight-elimination and time series prediction. In Proc. of the 1990 Connectionist Models Summer School, pp. 105–116. Morgan Kaufmann, San Mateo, CA.
Received November 23, 1994; accepted March 13, 1996.
Communicated by Jude Shavlik
Extracting Rules from Neural Networks by Pruning and Hidden-Unit Splitting Rudy Setiono Department of Information Systems and Computer Science, National University of Singapore, Kent Ridge, Singapore 0511, Republic of Singapore
An algorithm for extracting rules from a standard three-layer feedforward neural network is proposed. The trained network is first pruned not only to remove redundant connections in the network but, more important, to detect the relevant inputs. The algorithm generates rules from the pruned network by considering only a small number of activation values at the hidden units. If the number of inputs connected to a hidden unit is sufficiently small, then rules that describe how each of its activation values is obtained can be readily generated. Otherwise the hidden unit will be split and treated as output units, with each output unit corresponding to an activation value. A hidden layer is inserted and a new subnetwork is formed, trained, and pruned. This process is repeated until every hidden unit in the network has a relatively small number of input units connected to it. Examples on how the proposed algorithm works are shown using real-world data arising from molecular biology and signal processing. Our results show that for these complex problems, the algorithm can extract reasonably compact rule sets that have high predictive accuracy rates.
1 Introduction One of the most popular applications of feedforward neural networks is distinguishing patterns in two or more disjoint sets. The neural network approach has been applied to solve pattern classification problems in diverse areas such as finance, engineering, and medicine. While the predictive accuracy obtained by neural networks is usually satisfactory, it is often said that a neural network is practically a black box. Even for a network with only a single hidden layer, it is generally impossible to explain why a certain pattern is classified as a member of one set and another pattern as a member of another set, due to the complexity of the network. In many applications, however, it is desirable to have rules that explicitly state under what conditions a pattern is classified as a member of one set or another. Several approaches have been developed for extracting rules Neural Computation 9, 205–225 (1997)
c 1997 Massachusetts Institute of Technology °
206
Rudy Setiono
from a trained neural network. Saito and Nakano (1988) proposed a medical diagnostic expert system based on a multilayer neural network. They treated the network as a black box and used it only to observe the effects on the network output caused by changes in the inputs. Two methods for extracting rules from neural network are described in Towell and Shavlik (1993). The first method is the subset algorithm (Fu 1991), which searches for subsets of connections to a unit whose summed weight exceeds the bias of that unit. The second method, the MofN algorithm, clusters the weights of a trained network into equivalence classes. The complexity of the network is reduced by eliminating unneeded clusters and by setting all weights in each remaining cluster to the average of the cluster’s weights. Rules with weighted antecedents are obtained from the simplified network by translation of the hidden units and output units. Towell and Shavlik applied these two methods to knowledge-based neural networks that have been trained to recognize genes in DNA sequences. The topology and the initial weights of a knowledge-based network are determined using problem-specific a priori information. More recently, a method that uses sampling and queries was proposed (Craven and Shavlik 1994). Instead of searching for rules from the network, the problem of rule extraction is viewed as a learning task. The target concept is the function computed by the network, and the network input features are the inputs for the learning task. Conjunctive rules are extracted from the neural network with the help of two oracles. Thrun (1995) describes a rule-extraction algorithm that analyzes the inputoutput behavior of a network using validity interval analysis (VI). VI analysis divides the activation range of each network’s unit into intervals, such that all of the network’s activation values must lie within the intervals. The boundary of these intervals is obtained by solving linear programs. Two approaches of generating rule conjectures, specific-to-general and generalto-specific, are described. The validity of these conjectures is checked with VI analysis. In this article, we propose an algorithm to extract rules from a pruned network. We assume the network to be a standard feedforward backpropagation network with a single hidden layer that has been trained to meet a prespecified accuracy requirement. The assumption that the network has only a single hidden layer is not restrictive. Theoretical studies have shown that networks with a single hidden layer can approximate arbitrary decision boundaries (Hornik 1991). Experimental studies have also demonstrated the effectiveness of such networks (de Villiers and Barnard 1992). The process of extracting rules from a trained network can be made much easier if the complexity of the network has first been reduced. The pruning process attempts to eliminate as many connections as possible from the network while trying to maintain the prespecified accuracy rate. It is expected that fewer connections will result in more concise rules. No initial knowledge of the problem domain is required. Relevant and irrelevant attributes
Extracting Rules from Neural Networks
207
of the data are distinguished during the pruning process. Those that are relevant will be kept; others will be automatically discarded. A distinguishing feature of our algorithm is that the activation values at the hidden units are clustered into discrete values. When only a small number of inputs are feeding a hidden unit, it is not difficult to extract rules that describe how each of the discrete activation values is obtained. When many inputs are connected to a hidden unit, then its discrete activation values will be used as the target outputs of a new subnetwork. A new hidden layer is inserted between the inputs and this new output layer. The new network is then trained and pruned by the same pruning technique as in the original network. The process of splitting the hidden units and creating new networks is repeated until each hidden unit in the network has only a small number of inputs connected to it, say, no more than five. We describe our rule extraction algorithm in Section 2. A network that has been trained and pruned to solve the splice-junction problem (Lapedes et al. 1989) is used to illustrate in detail how the algorithm works. Each pattern in the data set for this problem is described by a 60-nucleotide-long DNA sequence; the learning task is to recognize the type of boundary at the center of the sequence. This boundary can be an exon-intron boundary, intron-exon boundary, or neither. In Section 3 we present the results of applying the proposed algorithm on a second problem, the sonar target classification problem (Gorman and Sejnowski 1988). The learning task is to distinguish between sonar returns from metal cylinders and those from cylindrically shaped rocks. Each pattern in this data set is described by a set of 60 real numbers between 0 and 1. The data sets for both the splicejunction and the sonar target classification problems were obtained via ftp from the University of California–Irvine repository (Murphy and Aha 1992). Discussion on the proposed algorithm and comparison with related work are presented in Section 4. Finally, a brief conclusion is given in Section 5.
2 Extracting Rules from a Pruned Network The algorithm for rule extraction from a pruned network basically consists of two main steps. The first step clusters the activation values of the hidden units into a small number of clusters. If the number of inputs connected to a hidden unit is relatively large, the second step of the algorithm splits the hidden unit into a number of units. The number of clusters found in the first step determines the number of new units. We form a new network by treating each new unit as an output unit and adding a new hidden layer. We train and prune this network and repeat the process if necessary. The process of forming a new network by treating a hidden unit as a new set of output units and adding a new hidden layer terminates when, after pruning, each hidden unit in the network has a small number of input units connected to it. The outline of the algorithm is as follows.
208
Rudy Setiono
Rule-extraction (RX) algorithm. 1. Train and prune the neural network. 2. Discretize the activation values of the hidden units by clustering. 3. Using the discretized activation values, generate rules that describe the network outputs. 4. For each hidden unit: • If the number of input connections is fewer than an upper bound UC , then extract rules to describe the activation values in terms of the inputs. • Else form a subnetwork: (a) Set the number of output units equal to the number of discrete activation values. Treat each discrete activation value as a target output. (b) Set the number of input units equal to the inputs connected to the hidden units. (c) Introduce a new hidden layer. Apply RX to this subnetwork. 5. Generate rules that relate the inputs and the outputs by merging rules generated in Steps 3 and 4. In Step 3 of algorithm RX, we assume that the number of hidden units and the number of clusters in each hidden unit are relatively small. If there are H hidden units in the pruned network, and Step 2 of RX finds Ci clusters in hidden unit i = 1, 2, . . . , H, then there are C1 × C2 . . . × CH possible combinations of clustered hidden-unit activation values. For each of these combinations, the network output is computed using the weights of the connections from the hidden units to the output units. Rules that describe the network outputs in terms of the clustered activation values are generated using the X2R algorithm (Liu and Tan 1995). The algorithm is designed to generate rules from small data sets with discrete attribute values. It chooses the pattern with the highest frequency, generates the shortest rule for this pattern by checking the information provided by one data attribute at a time, and repeats the process until rules that cover all the patterns are generated. The rules are then grouped according to the class labels, and redundant rules are eliminated by pruning. When there is no noise in the data, X2R generates rules with perfect accuracy on the training patterns. If the number of inputs of a hidden unit is fewer than UC , Step 4 of the algorithm also applies X2R to generate rules that describe the discrete activation values of that hidden unit in terms of the input units. Step 5 of RX merges the two sets of rules generated by X2R. This is done by substituting
Extracting Rules from Neural Networks
209
the conditions of the rules involving clustered activation values generated in Step 3 by the rules generated in Step 4. We shall illustrate how algorithm RX works using the splice-junction domain in the next section. 2.1 The Splice-Junction Problem. We trained a network to solve the splice-junction problem. This problem is a real-world problem arising in molecular biology that has been widely used as a test data set for knowledgebased neural networks training and rule-extraction algorithms (Towell and Shavlik 1993). The total number of patterns in the data set is 3175,1 with each pattern consisting of 60 attributes. The attribute corresponds to a DNA nucleotide and takes one of the following four values: G, T, C, or A. Each pattern is classified as IE (intron-exon boundary), EI (exon-intron boundary), or N (neither) according to the type of boundary at the center of the DNA sequence. Of the 3175 patterns, 1006 were used as training data. The four attribute values G, T, C, and A were coded as {1, 0, 0, 0}, {0, 1, 0, 0}, {0, 0, 1, 0}, and {0, 0, 0, 1}, respectively. Including the input for bias, 241 input units were present in the original neural network. The target output for each pattern in class IE was {1, 0, 0}, in class EI {0, 1, 0}, and in class N {0, 0, 1}. Five hidden units were sufficient to give the network more than 95% accuracy rate on the training data. We considered a pattern of type IE to be correctly classified as long as the largest activation value was found in the first output unit. Similarly, patterns of type EI and N were considered to be correctly classified if the largest activation value was obtained in the second and third hidden unit, respectively. The transfer functions used were the hyperbolic tangent function at the hidden layer and the sigmoid function at the output layer. The cross-entropy error measure was used as the objective function during training. To encourage weight decay, a penalty term was added to this objective function. The details of this penalty term and the pruning algorithm are described in Setiono (1997). Aiming to achieve a comparable accuracy rate as those reported in Towell and Shavlik (1993), we terminated the pruning algorithm when the accuracy of the network dropped below 92%. The resulting pruned network is depicted in Figure 1. There was a large reduction in the number of network connections. Of the original 1220 weights in the network, only 16 remained. Only 10 input units (including the bias input unit) remained out of the original 241 inputs. These inputs corresponded to 7 out of the 60 attributes present in the original data. Of the 5 original hidden units 2 were also removed. As we shall see in the next subsection, the number of input units and hidden units plays a crucial role
1 There are 15 more patterns in the data set that we did not use because one or more attributes has a value that is not G, T, C, or A.
210
Rudy Setiono
Figure 1: Pruned network for the splice-junction problem. Following convention, the original attributes have been numbered sequentially from −30 to −1 and 1 to 30. If the expression @n = X is true, then the corresponding input value is 1; otherwise it is 0. Table 1: Accuracy of the Pruned Network on the Training and Testing Data Sets.
Class
Training data Errors (total)
IE EI N Total
13 (243) 11 (228) 53 (535) 77 (1006)
Accuracy (%) 94.65 95.18 90.09 92.35
Testing data Errors (total) 32 (765) 42 (762) 150 (1648) 224 (3175)
Accuracy (%) 95.82 94.49 90.90 92.94
in determining the complexity of the rules extracted from the network. The accuracy of the pruned network is summarized in Table 1. 2.2 Clustering Hidden-Unit Activations. The range of the hyperbolic tangent function is the interval [−1, 1]. If this function is used as the transfer function at the hidden layer, a hidden-unit activation value may lie anywhere within this interval. However, it is possible to replace it by a discrete value without causing too much deterioration in the accuracy of the network. This is done by the following clustering algorithm.
Extracting Rules from Neural Networks
211
Hidden-Unit Activation-Values Clustering Algorithm 1. Let ² ∈ (0, 1). Let D be the number of discrete activation values in the hidden unit. Let δ1 be the activation value in the hidden unit for the first pattern in the training set. Let H(1) = δ1 , count(1) = 1, and sum(1) = δ1 ; set D = 1. 2. For each pattern pi , i = 2, 3, . . . , k in the training set: • Let δ be its activation value. • If there exists an index j¯ such that ¯ = |δ − H( j)|
min
j∈{1,2,...,D}
|δ − H(j)| and
¯ ≤ ², |δ − H( j)| ¯ := count( j) ¯ + 1, sum( j) ¯ := sum( j) ¯ +δ then set count( j) else D = D + 1, H(D) = δ, count(D) = 1, sum(D) = δ. 3. Replace H by the average of all activation values that have been clustered into this cluster: H(j) := sum(j)/count(j), j = 1, 2, . . . , D. Step 1 initializes the algorithm. The first activation value forms the first cluster. Step 2 checks whether subsequent activation values can be clustered into one of the existing clusters. The distance between an activation value ¯ is computed. If this under consideration and its nearest cluster, |δ − H( j)|, distance is less than ², then the activation value is clustered in cluster j.¯ Otherwise this activation value forms a new cluster. Once the discrete values of all hidden units have been obtained, the accuracy of the network is checked again with the activation values at the hidden units replaced by their discretized values. An activation value δ is ¯ where index j¯ is chosen such that j¯ = argmin |δ − H(j)|. If replaced by H( j), j the accuracy of the network falls below the required accuracy, then ² must be decreased and the clustering algorithm is run again. For a sufficiently small ², it is always possible to maintain the accuracy of the network with continuous activation values, although the resulting number of different discrete activations can be impractically large. The best ² value is one that gives a high accuracy rate after the clustering and at the same time generates as few clusters as possible. A simple way of obtaining an optimal value for ² is by searching in the interval (0, 1). The number of clusters and the accuracy of the network can be checked for all values of ² = iξ, i = 1, 2, . . . , where ξ is a small, positive scalar, such as 0.10. Note also that it is not necessary to fix the value of ² equal for all hidden units. For the pruned network depicted in Figure 1, we found the value of ² = 0.6 worked well for all three hidden units. The results of the clustering
212
Rudy Setiono
Table 2: Accuracy of the Pruned Network on the Training and Testing Data Sets with Discrete Activation Values at the Hidden Units. Class
Training data Errors (total)
IE EI N Total
16 (243) 8 (228) 52 (535) 76 (1006)
Accuracy (%) 93.42 96.49 90.28 92.45
Testing data Errors (total) 40 (765) 30 (762) 142 (1648) 212 (3175)
Accuracy (%) 94.77 96.06 91.38 93.32
algorithm were as follows: 1. Hidden unit 1. There were two discrete values: −0.04 and −0.99. Of the 1006 training data, 573 patterns had the first value, and 433 patterns had the second value. 2. Hidden unit 2: There were three discrete values: 0.96, −0.10, and −0.88. The distribution of the training data was 760, 72, and 174, respectively. 3. Hidden unit 3: There were two discrete values: 0.96 and 0. Of the 1006 training data, 520 patterns had the first value, and 486 patterns had the second value. The accuracy of the network with these discrete activation values is summarized in Table 2. With ² = 0.6, the accuracy was not the same as that achieved by the original network. In fact, it was slightly higher on both the training data and the testing data. It appears that a much smaller number of possible values for the hidden-unit activations reduces overfitting of the data. With a sufficiently small value for ², it is always possible to maintain the accuracy of a network after its activation values have been discretized. However, there is no guarantee that the accuracy can increase in general. Two discrete values at hidden unit 1, three at hidden unit 2, and two at hidden unit 3 produced 12 possible outputs for the network. These outputs are listed in Table 3. Let us define the notation α(i, j) to denote the hidden unit i taking its jth activation value. X2R generated the following rules that classify each of the predicted output classes: • If α(1, 1) and α(2, 1) and α(3, 1), then output = IE. • Else if α(2, 3), then output = EI. • Else if α(1, 1) and α(2, 2), then output = EI. • Else if α(1, 2) and α(2, 2) and α(3, 1), then output = EI.
Extracting Rules from Neural Networks
213
Table 3: Output of the Network with Discrete Activation Values at the Hidden Units. Hidden unit activations
Predicted output
Classification
1
2
3
1
2
3
−0.04 −0.04 −0.04 −0.04 −0.04 −0.04 −0.99 −0.99 −0.99 −0.99 −0.99 −0.99
0.96 0.96 −0.10 −0.10 −0.88 −0.88 0.96 0.96 −0.10 −0.10 −0.88 −0.88
0.96 0.00 0.96 0.00 0.96 0.00 0.96 0.00 0.96 0.00 0.96 0.00
0.44 0.44 0.44 0.44 0.44 0.44 0.00 0.00 0.00 0.00 0.00 0.00
0.01 0.01 0.62 0.62 0.99 0.99 0.01 0.01 0.62 0.62 0.99 0.99
0.20 0.99 0.00 0.42 0.00 0.02 0.91 1.00 0.06 0.97 0.00 0.43
IE N EI EI EI EI N N EI N EI EI
• Default output: N. Hidden unit 1 has only two inputs, one of which is the input for bias, and hidden unit 3 has only one input. Very simple rules in terms of the original attributes that describe the activation values of these hidden units were generated by X2R. The rules for these hidden units are: • If @-1=G, then α(1, 1); else α(1, 2). • If @-2=A, then α(3, 1); else α(3, 2). The expression @n=X is true implies that the corresponding input value is 1; otherwise the input value is 0. The seven input units feeding the second hidden unit can generate 64 possible instances. The three inputs @5 (= T, C, or A) can produce 4 different combinations—(0, 0, 0), (0, 0, 1), (0, 1, 0), or (1, 0, 0)—while the other 4 inputs can generate 16 different possible combinations. Rules that define how each of the three activation values of this hidden unit is obtained are not trivial to extract from the network. How this difficulty can be overcome is described in the next subsection. 2.3 Hidden-Unit Splitting and Creation of a Subnetwork. If we set the value of UC in algorithm RX to be less than 7, then in order to extract rules that describe how the three activation values of the second hidden unit are obtained, a subnetwork will be created.
214
Rudy Setiono
Since seven inputs determined the three activation values, the new network also had the same set of inputs. The number of output units corresponded to the number of activation values; in this case, three output units were needed. Each of the 760 patterns whose activation values were equal to 0.96 was assigned a target output of {1, 0, 0}. Patterns with activation values of −0.10 and −0.88 were assigned target outputs of {0, 1, 0} and {0, 0, 1}, respectively. In order to extract rules with the same accuracy rate as the network’s accuracy rate summarized in Table 2, the new network was trained to achieve a 100% accuracy rate. Five hidden units were sufficient to give the network this rate. The pruning process was terminated as soon as the accuracy dropped below 100%. When only 7 of the original 240 inputs were selected, many patterns had duplicates. To reduce the time to train and prune the new network, all duplicates were removed, and the 61 unique patterns left were used as training data. Since the network was trained to achieve a 100% rate, correct predictions of these 61 patterns guaranteed that the same rate of accuracy was obtained on all 1006 patterns. The pruned network is depicted in Figure 2. Three of the five hidden units remained after pruning. Of the original 55 connections in the fully connected network, 16 were still present. The most interesting aspect of the pruned network was that the activation values of the first hidden unit were determined solely by four inputs, while those of the second hidden unit were determined by the remaining three inputs. The activation values in the three hidden units were clustered. The value of ² = 0.2 was found to be small enough for the network with discrete activation values to maintain a 100% accuracy rate and large enough such that there were not too many different discrete values. The results were as follows: 1. Hidden unit 1: There were three discrete values: 0.98, 0.00, and −0.93. Of the original 1006 (61 training) patterns, 593 (39) patterns had the first value, 227 (15) the second value, and 186 (7) the third value. 2. Hidden unit 2: There were three discrete values: 0.88, 0.33, and −0.99. The distribution of the training data was 122 (8), 174 (8), and 710 (45), respectively. 3. Hidden unit 3: There was only one discrete value: 0.98. Since there were three different activation values at hidden units 1 and 2 and only one value at hidden unit 3, a total of nine possible outputs for the network were possible. These possible outputs are summarized in Table 4. In a similar fashion as before, let β(i, j) denote the hidden unit i taking its jth activation value. The following rules that classify each of the predicted output classes were generated by X2R from the data in Table 4: • If β(2, 3), then α(2, 1). • Else if β(1, 1) and β(2, 2), then α(2, 1).
Extracting Rules from Neural Networks
215
Figure 2: Pruned network for the second hidden unit of the original network depicted in Figure 1.
Table 4: Output of the Network with Discrete Activation Values at the Hidden Units for the Pruned Network in Figure 2.
Hidden unit activations
Predicted output
Classification
1
2
3
1
2
3
0.98 0.98 0.98 0.00 0.00 0.00 −0.93 −0.93 −0.93
0.88 0.33 −0.99 0.88 0.33 −0.99 0.88 0.33 −0.99
0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98
0.01 0.99 1.00 0.00 0.00 1.00 0.00 0.00 1.00
0.66 0.21 0.00 0.66 0.21 0.00 0.66 0.21 0.00
0.00 0.00 0.00 0.98 0.02 0.00 1.00 1.00 0.00
α(2, 2) α(2, 1) α(2, 1) α(2, 3) α(2, 2) α(2, 1) α(2, 3) α(2, 3) α(2, 1)
216
Rudy Setiono
• Else if β(1, 1) and β(2, 1), then α(2, 2). • Else if β(1, 2) and β(2, 2), then α(2, 2). • Default output: α(2, 3). X2R extracted rules, in terms of the original attributes, describing how each of the six different activation values of the two hidden units was obtained. These rules are given in Appendix A as level 3 rules. The complete rules generated from the splice-junction domain after the merging of rules in Step 5 of RX algorithm are also given in Appendix A. Since the subnetwork created for the second hidden unit of the original network had been trained to achieve a 100% accuracy rate and the value of ² used to cluster the activation values was also chosen to retain this level of accuracy, the accuracy rates of the rules were exactly the same as those of the network with discrete hidden-unit activation values listed in Table 2: 92.45% on the training data set and 93.32% on the test data set. The results of our experiment suggest that the extracted rules are able to mimic the network from which they are extracted perfectly. In general, if there is a sufficient number of discretized hidden-unit activation values and the subnetworks are trained and pruned to achieve a 100% accuracy rate, then the rules extracted will have the same accuracy rate as the network on the training data. It is possible to mimic the behavior of the pruned network with discrete activation values even on future patterns not in the training data set. Let S be the set of all patterns that can be generated by the inputs connected to a hidden unit. When the discrete activation values in the hidden unit are determined only by a subset of S , it is still possible to extract rules that will cover all possible instances that may be encountered in the future. For each pattern that is not represented in the training data, its continuous activation values are computed using the weights of the pruned network. Each of these activation values is assigned a discrete value in the hidden unit that is closest to it. Once the complete rules have been obtained, the network topology, the weights on the network connections, and the discrete hidden-unit activation values need not be retained. For our experiment on the splice-junction domain, we found that rules that cover all possible instances were generated without having to compute the activation values of patterns not already present in the training data. This was due to the relatively large number of patterns used during training and the small number of inputs found relevant for determining the hidden-unit activation values. 3 Experiment on a Data Set with Continuous Attributes In the previous section, we gave a detailed description of how rules can be extracted from a data set with only nominal attributes. In this section we shall illustrate how rules can also be extracted from a data set having
Extracting Rules from Neural Networks
217
Table 5: Relevant Attributes Found by ChiMerge and Their Subintervals. Attribute A35 A36 A44 A45 A47 A48 A49 A51 A52 A54 A55
Subintervals [0, 0.1300), [0.1300, 1] [0, 0.5070), [0.5070, 1] [0, 0.4280), [0.4280, 0.7760), [0.7760, 1] [0, 0.2810), [0.2810, 1] [0, 0.0610), [0.0610, 0.0910), [0.0910, 0.1530), [0.1530, 1] [0, 0.0762), [0.0762, 1] [0, 0.0453), [0.0453, 1] [0, 0.0215), [0.0215, 1] [0, 0.0054), [0.0054, 1] [0, 0.0226), [0.0226, 1] [0, 0.0047), [0.0047, 0.0057), [0.0057, 0.0064), [0.0064, 0.0127), [0.0127, 1]
continuous attributes. The sonar-returns classification problem (Gorman and Sejnowski 1988) was chosen for this purpose. The data set consisted of 208 sonar returns, each represented by 60 real numbers between 0.0 and 1.0. The task was to distinguish between returns from a metal cylinder and those from a cylindrically shaped rock. Returns from a metal cylinder were obtained at aspect angles spanning 90 degrees and those from rocks at aspect angles spanning 180 degrees. Gorman and Sejnowski (1988) used three-layer feedforward neural networks with a varying number of hidden units to solve this problem. Two series of experiments on this data set were reported: an aspect-angle independent series and an aspect-angle dependent series. In the aspect-angle independent series, the training patterns and the testing patterns were selected randomly from the total set of 208 patterns. In the aspect-angle dependent series, the training and the testing sets were carefully selected to contain patterns from all available aspect angles. It is not surprising that the performance of networks in the aspect-angle dependent series of experiments was better than those in the aspect-angle independent series. We chose the aspect-angle dependent data set to test the rule-extraction algorithm. The training set consisted of 49 sonar returns from metal cylinders and 55 sonar returns from rocks, and the testing set consisted of 62 sonar returns from metal cylinders and 42 sonar returns from rocks. In order to facilitate rule extraction, the numeric attributes of the data were first converted into discrete attributes. This was done by a modified version of the ChiMerge algorithm (Kerber 1992). The training patterns were first sorted according to the value of the attribute being discretized. Initial subintervals were formed by placing each unique value of the attribute in its own subinterval. The χ 2 value of adjacent intervals was computed, and the pairs of adjacent subintervals with the lowest χ 2 value were merged.
218
Rudy Setiono
In Kerber’s original algorithm, merging continues until all pairs of subintervals have χ 2 values exceeding a user-defined parameter χ 2 threshold. Instead of setting a fixed threshold value as a stopping condition for merging, we continued merging the data as long as there was no inconsistency in the discretized data. By inconsistency, we mean two or more patterns from different classes are assigned identical discretized values. Using this strategy, we found that many attributes values were merged into just one subinterval. Attributes with values that had been merged into one interval could be removed without introducing any inconsistency in the discretized data set. Let us denote the original 60 attributes by A1 , A2 , . . . , A60 . After discretization, eleven of these attributes were discretized into two or more discrete values. These attributes and the subintervals found by the ChiMerge algorithm are listed in Table 5. The thermometer coding scheme (Smith 1993) was used for the discretized attributes. Using this coding scheme, 28 binary inputs were needed. Let us denote these inputs by I1 , I2 , . . . , I28 . All patterns with attribute A35 less than 0.1300 were coded by I1 = 0, I2 = 1, while those with attribute value greater than or equal to 0.1300 were coded by I1 = 1, I2 = 1. The other 10 attributes were coded in similar fashion. A network with six hidden units and two output units was trained using the 104 binarized-feature training data set. One additional input for hidden unit thresholds was added to the network, giving a total of 29 inputs. After training was completed, network connections were removed by pruning. The pruning process was continued until the accuracy of the network dropped below 90%. The smallest network with an accuracy of more than 90% was saved for rule extraction. Two hidden units and 15 connections were present in the pruned network. Ten inputs— I1 , I3 , I8 , I11 , I12 , I14 , I20 , I22 , I25 , and I26 —were connected to the first hidden unit. Only one input, I17 , was connected the second hidden unit. Two connections connected each of the two hidden units to the two output units. The hidden-unit activation values clustering algorithm found three discrete activation values, −0.95, 0.94, and −0.04, at the first hidden unit. The number of training patterns having these activation values were 58, 45, and 1, respectively. At the second hidden unit there was only one value, −1. The overall accuracy of the network was 91.35%. All 49 metal cylinder returns in the training data were correctly classified, and 9 rock returns were incorrectly classified as metal cylinder returns. By computing the predicted outputs of the network with discretized hidden-unit activation values, we found that as long as the activation value of the first hidden unit equals −0.95, a pattern would be classified as a sonar return from metal cylinder. Because ten inputs were connected to this hidden unit, a new network was formed to allow us to extract rules. It was sufficient to distinguish between activation values of −0.95 and those not equal to −0.95, that is, 0.94 and −0.04. Hence the number of output units in
Extracting Rules from Neural Networks
219
Figure 3: Pruned network trained to distinguish between patterns having discretized activation values equal to −0.95 and those not equal to −0.95.
the new network was two. Samples with activation values equal to −0.95 were assigned target values of {1, 0}; all other patterns were assigned target values of {0, 1}. The number of hidden units was six. The pruned network is depicted in Figure 3. Note that this network’s accuracy rate was 100%; it correctly distinguished all 58 training patterns having activation values equal to −0.95 from those having activation values 0.94 or −0.04. In order to preserve this accuracy rate, six and three discrete activation values were required at the two hidden units left in the pruned network. At the first hidden unit, the activation values found were −1.00, −0.54, −0.12, 0.46, −0.76, and 0.11. At the second hidden unit, the values were −1.00, 1.00, and 0.00. All 18 possible combinations of these values were given as input to X2R. The target for each combination was computed using the weights on the connections from the hidden units to the output units of the network. Let β(i, j) denote hidden unit i taking its jth activation values. The following rules that describe the classification of the sonar returns were generated by X2R: • If β(1, 1), then output = Metal. • Else if β(2, 2), then output = Metal. • Else if β(1, 2) and β(2, 3), then output = Metal. • Else if β(1, 3) and β(2, 3), then output = Metal. • Else if β(1, 5), then output = Metal. • Default output: Rock.
220
Rudy Setiono
The activation values of the first hidden units were determined by five inputs: I1 , I8 , I12 , I25 , and I26 . Those of the second hidden units were determined by only four inputs: I3 , I11 , I14 , and I22 . Four activation values at the first hidden unit (−1.00, −0.54, −0.12, and −0.76) and two values at the second hidden unit (1.00 and 0.00) were involved in the rules that classified a pattern as a return from metal cylinders. The rules that determined these activation values were as follows: • Hidden unit 1: (initially, β(1, 1) = β(1, 2) = β(1, 3) = β(1, 5) = false): — — — — — —
If (I25 = 0) and (I26 = 1), then β(1, 1). If (I8 = 1) and (I25 = 1), then β(1, 2). If (I8 = 0) and (I12 = 1) and (I26 = 0), then β(1, 2). If (I8 = 0) and (I12 = 1) and (I25 = 1), then β(1, 3). If (I1 = 0) and (I12 = 0) and (I26 = 0), then β(1, 3). If (I8 = 1) and (I26 = 0), then β(1, 5).
• Hidden unit 2: (initially, β(2, 2) = β(2, 3) = false): — If (I3 = 0) and (I14 = 1), then β(2, 2). — If (I22 = 1), then β(2, 2). — If (I3 = 0) and (I11 = 0) and (I14 = 0) and (I22 = 0), then β(2, 3). With the thermometer coding scheme used to encode the original continuous data, it is easy to obtain rules in terms of the original attributes and their values. The complete set of rules generated for the sonar returns data set is given in Appendix B. A total of eight rules that classify a return as that from metal cylinder were generated. Two metal returns and three rock returns satisfied the conditions of one of these eight rules. With this rule removed, the remaining seven rules correctly classified 96 of the 104 training data (92.31%) and 101 of the 104 testing data (97.12%). 4 Discussion and Comparison with Related Work We have shown how rules can be extracted from a trained neural network without making any assumptions about the network’s activations or having initial knowledge about the problem domain. If some knowledge is available, however, it can always be incorporated into the network. For example, connections in the network from inputs thought to be not relevant can be given large penalty parameters during training, while those thought to be relevant can be given zero or small penalty parameters (Setiono 1997). Our algorithm does not require thresholded activation function to force the activation values to be zero or one (Towell and Shavlik 1993; Fu 1991), nor does it require the weights of the connections to be restricted in a certain range (Blassig 1994).
Extracting Rules from Neural Networks
221
Craven and Shavlik (1994) mentioned that one of the difficulties of the search-based method such as that of Saito and Nakano (1988) and Fu (1991) is the complexity of the rules. Some rules are found deep in the search space, and these search methods, having exponential complexity, may not be effective. Our algorithm, in contrast, finds intermediate rules embedded in the hidden units by training and pruning new subnetworks. The topology of any subnetwork is always the same: a three-layer feedforward one regardless of how deep these rules are in the search space. Each hidden unit, if necessary, is split into several output units of a new subnetwork. When more than one new network is created, each can be trained independently. As the activation values of a hidden unit are normally determined by a subset of the original inputs, we can substantially speed up the training of the subnetwork by using only the relevant inputs and removing duplicate patterns. The use of a standard three-layer network is another advantage of our method; any off-the-shelf training algorithm can be used. The accuracy and the number of rules generated by our algorithm are better than those obtained by C4.5 (Quinlan 1993), a popular machine learning algorithm that extracts rules from decision trees. For the splice-junction problem, C4.5’s accuracy rate on the testing data was 92.13%, and the number of rules generated was 43. Our algorithm achieved 1% higher accuracy (93.32%) with far fewer rules (10). The MofN algorithm achieved a 92.8% average accuracy rate, the number of rules was 20, and the number of conditions per rule was more than 100. These figures were obtained from 10 repetitions of tenfold cross-validation on a set of 1000 randomly selected training patterns (Towell and Shavlik 1993). Using 30% of the 3190 samples for training, 10% for cross-validation, and 60% for testing, the gradient descent symbolic rule generation was reported to achieve an accuracy of 93.25%. There were three intermediate rules for the hidden units and two rules for the two output units defining the classes EI and IE (Blassig 1994). For the sonar-return classification problem, the accuracy of the rules generated by our algorithm on the testing set was 2.88% higher (97.12% versus 94.24%) than that of C4.5. The number of rules generated by RX was only seven, three fewer than C4.5’s rules. 5 Conclusion Neural networks are often viewed as black boxes. Although their predictive accuracy is high, one usually cannot understand why a particular outcome is predicted. In this paper, we have attempted to open up these black boxes. Two factors make this possible. The first is a robust pruning algorithm. When redundant weights are eliminated, redundant input and hidden units are identified and removed from the network. Removal of these redundant units significantly simplifies the process of rule extraction and the extracted rules themselves. The second factor is the clustering of the hidden-unit activation values. The fact that the number of distinct activation values at
222
Rudy Setiono
the hidden units can be made small enough enables us to extract simple rules. An important feature of our rule-extraction algorithm is its recursive nature. When, after pruning, a hidden unit is still connected to a relatively large number of inputs, in order to generate rules that describe its activation values, a new network is formed. The rule-extraction algorithm is applied to a new network. By merging the rules extracted from the new network and the rules extracted from the original network, hierarchical rules are generated. The viability of the proposed method has been shown on two sets of real-world data. Compact sets of rules that achieved a high accuracy rate were extracted for these data sets. Appendix A The following rules are generated by the algorithm for the splice-junction domain (initial values for β(i, j) and α(i, j) are false): • Level 3 (rules that define the activation values of the subnetwork depicted in Fig. 2): 1. If (@4=A ∧ @5=C) ∨ (@46=A ∧ @5=G), then β(1, 2), else if (@4=A ∧ @5=G), then β(1, 3), else β(1, 1). 2. If (@1=G ∧ @2=T ∧ @3=A), then β(2, 1), else if (@1=G ∧ @2=T ∧ @36= A), then β(2, 2), else β(2, 3). • Level 2 (rules that define the activation values of the second hidden unit of the network depicted in Fig. 1): If β(2, 3) ∨ (β(1, 1) ∧ β(2, 2)), then α(2, 1), else if (β(1, 1) ∧ β(2, 1)) ∨ (β(1, 2) ∧ β(2, 2)), then α(2, 2), else α(2, 3). • Level 1 (rules extracted from the pruned network depicted in Fig. 1): If (@-1=G ∧ α(2, 1) ∧ @-2=A), then IE, else if α(2, 3) ∨ (@-1=G ∧ α(2, 2))∨ (@-16=G ∧ α(2, 2) ∧ @-2=A), then EI, else N. On removal of the intermediate rules in levels 2 and 3, we obtain the following equivalent set of rules: IE :- @-2 ’AGH----’. IE :- @-2 ’AG-V---’. IE :- @-2 ’AGGTBAT’.
Extracting Rules from Neural Networks
223
Table A.1: Accuracy of the Extracted Rules on the Splice-Junction Training and Testing Data Sets.
Class
Training data Errors (total)
IE EI N Total
17 (243) 10 (228) 49 (535) 76 (1006)
IE IE EI EI EI EI EI EI EI EI EI EI EI N.
:::::::::::::-
@-2 @-2 @-2 @-2 @-2 @-2 @-2 @-2 @-2 @-2 @-2 @-2 @-2
Testing data
Accuracy (%) 93.00 95.61 90.84 92.45
’AGGTBAA’. ’AGGTBBH’. ’--GT-AG’. ’--GTABG’. ’--GTAAC’. ’-GGTAAW’. ’-GGTABH’. ’-GGTBBG’. ’-GGTBAC’. ’AHGTAAW’. ’AHGTABH’. ’AHGTBBG’. ’AHGTBAC’.
Errors (total) 48 (765) 40 (762) 134 (1648) 222 (3175)
Accuracy (%) 93.73 94.75 91.50 93.01
(*)
(*) (*) (*) (*) (*)
An extended version of standard Prolog notation has been used to express the rules concisely. For example, the first rule indicates that an A in position −2, a G in position −1, and an H in position 1 will result in the classification of the boundary type to be an intron-exon. Following convention, the letter B means C or G or T; W means A or T; H means A or C or T; and V means A or C or G. The character ‘-’ indicates any one of the four nucleotides A, C, G, or T. Each rule marked by (∗) classifies no more than three patterns out the 1006 patterns in the training set. The accuracy rates in Table A.1 were obtained using the unmarked 10 rules for classification. Appendix B The following rules were generated by the algorithm for the sonar-returns classification problem (original attributes are assumed to have been labeled
224
Rudy Setiono
Table B.1: Accuracy of the Extracted Rules on the Sonar-Returns Training and Testing Data Sets.
Class
Training data Errors (total)
Metal Rock Total
2 (49) 6 (55) 8 (104)
Accuracy (%) 95.92 89.09 92.31
Testing data Errors (total) 0 (62) 3 (42) 3 (104)
Accuracy (%) 100.00 92.86 97.12
A1, A2, . . . , A60): Metal Metal Metal Metal Metal
:::::-
A36 A54 A45 A55 A36 A47 A55 Metal :- A35 A48 Metal :- A36 A48 Metal :- A36 A47 A55 Rock.
< 0.5070 and A48 >= 0.0762. >= 0.0226. >= 0.2810 and A55 < 0.0057. < 0.0064 and A55 >= 0.0057. < 0.5070 and A45 < 0.2810 and A47 < 0.0910 and >= 0.0610 and A48 < 0.0762 and A54 < 0.0226 and < 0.0057. < 0.1300 and A36 < 0.5070 and A47 < 0.0610 and < 0.0762 and A54 < 0.0226 and A55 < 0.0057. < 0.5070 and A45 >= 0.2810 and A47 < 0.0910 and < 0.0762 and A54 < 0.0226 and A55 >= 0.0064. (*) < 0.5070 and A45 < 0.2810 and A47 < 0.0910 and >= 0.0610 and A48 < 0.0762 and A54 < 0.0226 and >= 0.0064. (**)
The rule marked (∗) does not classify any pattern in the training set. The conditions of the rule marked (∗∗) are satisfied by two metal returns and three rock returns in the training set. The accuracy rates in Table B.1 are obtained using the six unmarked rules. Acknowledgments I thank my colleague Huan Liu for providing the results of C4.5 and the two anonymous reviewers for their comments and suggestions, which greatly improved the presentation of this article.
Extracting Rules from Neural Networks
225
References Blassig, R. 1994. GDS: Gradient descent generation of symbolic classification rules. In Advances in Neural Information Processing Systems, Vol. 6, pp. 1093– 1100. Morgan Kaufmann, San Mateo, CA. Craven, M. W., and Shavlik, J. W. 1994. Using sampling and queries to extract rules from trained neural networks. In Proc. of the Eleventh International Conference on Machine Learning. Morgan Kaufmann, San Mateo, CA. de Villiers, J., and Barnard, E. 1992. Backpropagation neural nets with one and two hidden layers. IEEE Trans. on Neural Networks 4(1), 136–141. Fu, L. 1991. Rule learning by searching on adapted nets. In Proc. of the Ninth National Conference on Artificial Intelligence, pp. 590–595. AAAI Press/MIT Press, Menlo Park, CA. Gorman, R. P., and Sejnowski, T. J. 1988. Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks 1, 75–89. Hornik, K. 1991. Approximation capabilities of multilayer feedforward neural networks. Neural Networks 4, 251–257. Kerber, R. 1992. Chi-merge: Discretization of numeric attributes. In Proc. of the Ninth National Conference on Artificial Intelligence, pp. 123–128. AAAI Press/MIT Press, Menlo Park, CA. Lapedes, A., Barnes, C., Burks, C., Farber, R., and Sirotkin, K. 1989. Application of neural networks and other machine learning algorithms to DNA sequence analysis. In Computers and DNA, pp. 157–182, Addison-Wesley, Redwood City, CA. Liu, H., and Tan, S. T. 1995. X2R: A fast rule generator. In Proceedings of IEEE International Conference on Systems, Man and Cybernetics, pp. 1631–1635. IEEE Press, New York. Murphy, P. M., and Aha, D. W. 1992. UCI repository of machine learning databases. Machine-readable data repository, University of California, Irvine, CA. Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA. Saito, K., and Nakano, R. 1988. Medical diagnosis expert system based on PDP model. In Proc. IEEE International Conference on Neural Networks, pp. 1255– 1262. IEEE Press, New York. Setiono, R. 1997. A penalty-function approach for pruning feedforward neural networks. Neural Computation 9, 185–204. Smith, M. 1993. Neural networks for statistical modelling. Van Nostrand Reinhold, New York. Thrun, S. 1995. Extracting rules from artificial neural networks with distributed representations. In Advances in Neural Information Processing Systems, Vol. 7. MIT Press, Cambridge, MA. Towell, G. G., and Shavlik, J. W. 1993. Extracting refined rules from knowledgebased neural networks. Machine Learning 13(1), 71–101. Received November 23, 1994; accepted March 13, 1996.
REVIEW
Communicated by Geoffrey Hinton
Probabilistic Independence Networks for Hidden Markov Probability Models Padhraic Smyth Department of Information and Computer Science, University of California at Irvine, Irvine, CA 92697-3425 USA and Jet Propulsion Laboratory 525-3660, California Institute of Technology, Pasadena, CA 91109 USA
David Heckerman Microsoft Research, Redmond, WA 98052-6399 USA
Michael I. Jordan Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
Graphical techniques for modeling the dependencies of random variables have been explored in a variety of different areas, including statistics, statistical physics, artificial intelligence, speech recognition, image processing, and genetics. Formalisms for manipulating these models have been developed relatively independently in these research communities. In this paper we explore hidden Markov models (HMMs) and related structures within the general framework of probabilistic independence networks (PINs). The paper presents a self-contained review of the basic principles of PINs. It is shown that the well-known forward-backward (F-B) and Viterbi algorithms for HMMs are special cases of more general inference algorithms for arbitrary PINs. Furthermore, the existence of inference and estimation algorithms for more general graphical models provides a set of analysis tools for HMM practitioners who wish to explore a richer class of HMM structures. Examples of relatively complex models to handle sensor fusion and coarticulation in speech recognition are introduced and treated within the graphical model framework to illustrate the advantages of the general approach. 1 Introduction For multivariate statistical modeling applications, such as hidden Markov modeling (HMM) for speech recognition, the identification and manipulation of relevant conditional independence assumptions can be useful for model building and analysis. There has recently been a considerable amount Neural Computation 9, 227–269 (1997)
c 1997 Massachusetts Institute of Technology °
228
Padhraic Smyth, David Heckerman, and Michael I. Jordan
of work exploring the relationships between conditional independence in probability models and structural properties of related graphs. In particular, the separation properties of a graph can be directly related to conditional independence properties in a set of associated probability models. The key point of this article is that the analysis and manipulation of generalized HMMs (more complex HMMs than the standard first-order model) can be facilitated by exploiting the relationship between probability models and graphs. The major advantages to be gained are in two areas: • Model description. A graphical model provides a natural and intuitive medium for displaying dependencies that exist between random variables. In particular, the structure of the graphical model clarifies the conditional independencies in the associated probability models, allowing model assessment and revision. • Computational efficiency. The graphical model is a powerful basis for specifying efficient algorithms for computing quantities of interest in the probability model (e.g., calculation of the probability of observed data given the model). These inference algorithms can be specified automatically once the initial structure of the graph is determined. We will refer to both probability models and graphical models. Each consists of structure and parameters. The structure of the model consists of the specification of a set of conditional independence relations for the probability model or a set of (missing) edges in the graph for the graphical model. The parameters of both the probability and graphical models consist of the specification of the joint probability distribution: in factored form for the probability model and defined locally on the nodes of the graph for the graphical model. The inference problem is that of the calculation of posterior probabilities of variables of interest given observable data and a specification of the probabilistic model. The related task of maximum a posteriori (MAP) identification is the determination of the most likely state of a set of unobserved variables, given observed variables and the probabilistic model. The learning or estimation problem is that of determining the parameters (and possibly structure) of the probabilistic model from data. This article reviews the applicability and utility of graphical modeling to HMMs and various extensions of HMMs. Section 2 introduces the basic notation for probability models and associated graph structures. Section 3 summarizes relevant results from the literature on probabilistic independence networks (PINs), in particular, the relationships that exist between separation in a graph and conditional independence in a probability model. Section 4 interprets the standard first-order HMM in terms of PINs. In Section 5 the standard algorithm for inference in a directed PIN is discussed and applied to the standard HMM in Section 6. A result of interest is that the forward-backward (F-B) and Viterbi algorithms are shown to be special cases of this inference algorithm. Section 7 shows that the inference algo-
Probabilistic Independence Networks
229
rithms for undirected PINs are essentially the same as those already discussed for directed PINs. Section 8 introduces more complex HMM structures for speech modeling and analyzes them using the graphical model framework. Section 9 reviews known estimation results for graphical models and discusses their potential implications for practical problems in the estimation of HMM structures, and Section 10 contains summary remarks. 2 Notation and Background Let U = {X1 , X2 , . . . , XN } represent a set of discrete-valued random variables. For the purposes of this article we restrict our attention to discretevalued random variables; however, many of the results stated generalize directly to continuous and mixed sets of random variables (Lauritzen and Wermuth 1989; Whittaker 1990). P Let lowercase xi denote one of the values of variable Xi : the notation x1 is taken to mean the sum over all possible values of X1 . Let p(xi ) be shorthand for the particular probability p(Xi = xi ), whereas p(Xi ) represents the probability function for Xi (a table of values, since Xi is assumed discrete), 1 ≤ i ≤ N. The full joint distribution function is p(U) = p(X1 , X2 , . . . , XN ), and p(u) = (x1 , x2 , . . . , xN ) denotes a particular value assignment for U. Note that this full joint distribution p(U) = p(X1 , X2 , . . . , XN ) provides all the possible information one needs to calculate any marginal or conditional probability of interest among subsets of U. If A, B, and C are disjoint sets of random variables, the conditional independence relation A ⊥ B|C is defined such that A is independent of B given C, that is, p(A, B|C) = p(A|C)p(B|C). Conditional independence is symmetric. Note also that marginal independence (no conditioning) does not in general imply conditional independence, nor does conditional independence in general imply marginal independence (Whittaker 1990). With any set of random variables U we can associate a graph G defined as G = (V, E). V denotes the set of vertices or nodes of the graph such that there is a one-to-one mapping between the nodes in the graph and the random variables, that is, V = {X1 , X2 , . . . , XN }. E denotes the set of edges, {e(i, j)}, where i and j are shorthand for the nodes Xi and Xj , 1 ≤ i, j ≤ N. Edges of the form e(i, i) are not of interest and thus are not allowed in the graphs discussed in this article. An edge may be directed or undirected. Our convention is that a directed edge e(i, j) is directed from node i to node j, in which case we sometimes say that i is a parent of its child j. An ancestor of node i is a node that has as a child either i or another ancestor of i. A subset of nodes A is an ancestral set if it contains its own ancestors. A descendant of i is either a child of i or a child of a descendant of i. Two nodes i and j are adjacent in G if E contains the undirected or directed edge e(i, j). An undirected path is a sequence of distinct nodes {1, . . . , m} such that there exists an undirected or directed edge for each pair of nodes
230
Padhraic Smyth, David Heckerman, and Michael I. Jordan
{l, l+1} on the path. A directed path is a sequence of distinct nodes {1, . . . , m} such that there exists a directed edge for each pair of nodes {l, l + 1} on the path. A graph is singly connected if there exists only one undirected path between any two nodes in the graph. An (un)directed cycle is a path such that the beginning and ending nodes on the (un)directed path are the same. If E contains only undirected edges, then the graph G is an undirected graph (UG). If E contains only directed edges then the graph G is a directed graph (DG). Two important classes of graphs for modeling probability distributions that we consider in this paper are UGs and acyclic directed graphs (ADGs)— directed graphs having no directed cycles. We note in passing that there exists a theory for graphical independence models involving both directed and undirected edges (chain graphs, Whittaker 1990), but these are not discussed here. For a UG G, a subset of nodes C separates two other subsets of nodes A and B if every path joining every pair of nodes i ∈ A and j ∈ B contains at least one node from C. For ADGs, analogous but somewhat more complicated separation properties exist. A graph G is complete if there are edges between all pairs of nodes. A cycle in an undirected graph is chordless if none other than successive pairs of nodes in the cycle are adjacent. An undirected graph G is triangulated if and only if the only chordless cycles in the graph contain no more than three nodes. Thus, if one can find a chordless cycle of length four or more, G is not triangulated. A clique in an undirected graph G is a subgraph of G that is complete. A clique tree for G is a tree of cliques such that there is a one-to-one correspondence between the cliques of G and the nodes of the tree. 3 Probabilistic Independence Networks We briefly review the relation between a probability model p(U) = p(X1 , . . . , XN ) and a probabilistic independence network structure G = (V, E) where the vertices V are in one-to-one correspondence with the random variables in U. (The results in this section are largely summarized versions of material in Pearl 1988 and Whittaker 1990.) A PIN structure G is a graphical statement of a set of conditional independence relations for a set of random variables U. Absence of an edge e(i, j) in G implies some independence relation between Xi and Xj . Thus, a PIN structure G is a particular way of specifying the independence relationships present in the probability model p(U). We say that G implies a set of probability models p(U), denoted as PG , that is, p(U) ∈ PG . In the reverse direction, a particular model p(U) embodies a particular set of conditional independence assumptions that may or may not be representable in a consistent graphical form. One can derive all of the conditional independence
Probabilistic Independence Networks
231
Figure 1: An example of a UPIN structure G that captures a particular set of conditional independence relationships among the set of variables {X1 , . . . , X6 }—for example, X5 ⊥ {X1 , X2 , X4 , X6 } | {X3 }.
properties and inference algorithms of interest for U without reference to graphical models. However, as has been emphasized in the statistical and artificial intelligence literature, and as reiterated in this article in the context of HMMs, there are distinct advantages to be gained from using the graphical formalism. 3.1 Undirected Probabilistic Independence Networks (UPINs). A UPIN is composed of both a UPIN structure and UPIN parameters. A UPIN structure specifies a set of conditional independence relations for a probability model in the form of an undirected graph. UPIN parameters consist of numerical specifications of a particular probability model consistent with the UPIN structure. Terms used in the literature to describe UPINs of one form or another include Markov random fields (Isham 1981; Geman and Geman 1984), Markov networks (Pearl 1988), Boltzmann machines (Hinton and Sejnowski 1986), and log-linear models (Bishop et al. 1973). 3.1.1 Conditional Independence Semantics of UPIN Structures. Let A, B, and S be any disjoint subsets of nodes in an undirected graph G. G is a UPIN structure for p(U) if for any A, B, and S such that S separates A and B in G, the conditional independence relation A ⊥ B|S holds in p(U). The set of all conditional independence relations implied by separation in G constitutes the (global) Markov properties of G. Figure 1 shows a simple example of a UPIN structure for six variables. Thus, separation in the UPIN structure implies conditional independence in the probability model; i.e., it constrains p(U) to belong to a set of probability models PG that obey the Markov properties of the graph. Note that a complete UG is trivially a UPIN structure for any p(U) in the sense that
232
Padhraic Smyth, David Heckerman, and Michael I. Jordan
there are no constraints on p(U). G is a perfect undirected map for p if G is a UPIN structure for p(U) and all the conditional independence relations present in p(U) are represented by separation in G. For many probability models p there are no perfect undirected maps. A weaker condition is that a UPIN structure G is minimal for a probability model p(U) if the removal of any edge from G implies an independence relation not present in the model p(U); that is, the structure without the edge is no longer a UPIN structure for p(U). Minimality is not equivalent to perfection (for UPIN structures) since, for example, there exist probability models with independencies that cannot be represented as UPINs except for the complete UPIN structure. For example, consider that X and Y are marginally independent but conditionally dependent given Z (e.g., X and Y are two independent causal variables with a common effect Z). In this case the complete graph is the minimal UPIN structure for {X, Y, Z}, but it is not perfect because of the presence of an edge between X and Y. 3.1.2 Probability Functions on UPIN structures. Given a UPIN structure G, the joint probability distribution for U can be expressed as a simple factorization: p(u) = p(x1 , . . . , xN ) =
Y
aC (xC ),
(3.1)
VC
where VC is the set of cliques of G, xC represents an assignment of values to the variables in a particular clique C, and the aC (xC ) are non-negative clique functions. (The domain of each aC (xC ) is the set of possible assignments of values to the variables in the clique C, and the range of aC (xC ) is the semiinfinite interval [0, ∞).) The set of clique functions associated with a UPIN structure provides the numerical parameterization of the UPIN. A UPIN is equivalent to a Markov random field (Isham 1981). In the Markov random field literature the clique functions are generally referred to as potential functions. A related terminology, used in the context of the Boltzmann machine (Hinton and Sejnowski 1986), is that of energy function. The exponential of the negative energy of a configuration is a Boltzmann factor. Scaling each Boltzmann factor by the sum across Boltzmann factors (the partition function) yields a factorization of the joint density (the Boltzmann distribution), that is, a product of clique functions.1 The advantage of defining clique functions directly rather than in terms of the exponential of an energy function is that the range of the clique functions can be allowed 1 A Boltzmann machine is a special case of a UPIN in which the clique functions can be decomposed into products of factors associated with pairs of variables. If the Boltzmann machine is augmented to include “higher-order” energy terms, one for each clique in the graph, then we have a general Markov random field or UPIN, restricted to positive probability distributions due to the exponential form of the clique functions.
Probabilistic Independence Networks
233
Figure 2: A triangulated version of the UPIN structure G from Figure 1.
to contain zero. Thus equation 3.1 can represent configurations of variables having zero probability. A model p is said to be decomposable if it has a minimal UPIN structure G that is triangulated (see Fig. 2). A UPIN structure G is decomposable if G is triangulated. For the special case of decomposable models, G can be converted to a junction tree, which is a tree of cliques of G arranged such that the cliques satisfy the running intersection property, namely, that each node in G that appears in any two different cliques also appears in all cliques on the undirected path between these two cliques. Associated with each edge in the junction tree is a separator S, such that S contains the variables in the intersection of the two cliques that it links. Given a junction tree representation, one can factorize p(U) as the product of clique marginals over separator marginals (Pearl 1988): Q C∈V p(u) = Q C S∈VS
p(xC ) p(xS )
,
(3.2)
where p(xC ) and p(xS ) are the marginal (joint) distributions for the variables in clique C and separator S, respectively, and VC and VS are the set of cliques and separators in the junction tree. This product representation is central to the results in the rest of the article. It is the basis of the fact that globally consistent probability calculations on U can be carried out in a purely local manner. The mechanics of these local calculations will be described later. At this point it is sufficient to note that the complexity of the local inference algorithms scales as the sum of the sizes of the clique state-spaces (where a clique state-space is equal to the product over each variable in the clique of the number of states of each variable). Thus, local clique updating can make probability calculations on U much more tractable than using “brute force” inference if the model decomposes into relatively small cliques.
234
Padhraic Smyth, David Heckerman, and Michael I. Jordan
Many probability models of interest may not be decomposable. However, we can define a decomposable cover G0 for p such that G0 is a triangulated, but not necessarily minimal, UPIN structure for p. Since any UPIN G can be triangulated simply by the addition of the appropriate edges, one can always identify at least one decomposable cover G0 . However, a decomposable cover may not be minimal in that it can contain edges that obscure certain independencies in the model p; for example, the complete graph is a decomposable cover for all possible probability models p. For efficient inference, the goal is to find a decomposable cover G0 such that G0 contains as few extra edges as possible over the original UPIN structure G. Later we discuss a specific algorithm for finding decomposable covers for arbitrary PIN structures. All singly connected UPIN structures imply probability models PG that are decomposable. Note that given a particular probability model p and a UPIN G for p, the process of adding extra edges to G to create a decomposable cover does not change the underlying probability model p; that is, the added edges are a convenience for manipulating the graphical representation, but the underlying numerical probability specifications remain unchanged. An important point is that decomposable covers have the running intersection property and thus can be factored as in equation 3.2. Thus local clique updating is also possible with nondecomposable models by this conversion. Once again, the complexity of such local inference scales with the sum of the size of the clique state-spaces in the decomposable cover. In summary, any UPIN structure can be converted to a junction tree permitting inference calculations to be carried out purely locally on cliques. 3.2 Directed Probabilistic Independence Networks (DPINs). A DPIN is composed of both a DPIN structure and DPIN parameters. A DPIN structure specifies a set of conditional independence relations for a probability model in the form of a directed graph. DPIN parameters consist of numerical specifications of a particular probability model consistent with the DPIN structure. DPINs are referred to in the literature using different names, including Bayes network, belief network, recursive graphical model, causal (belief) network, and probabilistic (causal) network. 3.2.1 Conditional Independence Semantics of DPIN Structures. A DPIN structure is an ADG GD = (V, E) where there is a one-to-one correspondence between V and the elements of the set of random variables U = {X1 , . . . , XN }. It is convenient to define the moral graph GM of GD as the undirected graph obtained from GD by placing undirected edges between all nonadjacent parents of each node and then dropping the directions from the remaining directed edges (see Fig. 3b for an example). The term moral was coined to denote the “marrying” of “unmarried” (nonadjacent) parents. The motivation behind this procedure will become clear when we discuss the differ-
Probabilistic Independence Networks
235
(a)
(b)
Figure 3: (a) A DPIN structure GD that captures a set of independence relationships among the set {X1 , . . . , X5 }—for example, X4 ⊥ X1 |X2 . (b) The moral graph GM for GD , where the parents of X4 have been linked.
ences between DPINs and UPINs in Section 3.3. We shall also see that this conversion of a DPIN into a UPIN is a convenient way to solve DPIN inference problems by “transforming” the problem into an undirected graphical setting and taking advantage of the general theory available for undirected graphical models. We can now define a DPIN as follows. Let A, B, and S be any disjoint subsets of nodes in GD . GD is a DPIN structure for p(U) if for any A, B, and S such that S separates A and B in GD , the conditional independence relation A ⊥ B|S holds in p(U). This is the same definition as for a UPIN structure except that separation has a more complex interpretation in the directed context: S separates A from B in a directed graph if S separates A from B in the moral (undirected) graph of the smallest ancestral set containing A, B, and S (Lauritzen et al. 1990). It can be shown that this definition of a DPIN structure is equivalent to the more intuitive statement that given the values of its parents, a variable Xi is independent of all other nodes in the directed graph except for its descendants.
236
Padhraic Smyth, David Heckerman, and Michael I. Jordan
Thus, as with a UPIN structure, the DPIN structure implies certain conditional independence relations, which in turn imply a set of probability models p ∈ PGD . Figure 3a contains a simple example of a DPIN structure. There are many possible DPIN structures consistent with a particular probability model p(U), potentially containing extra edges that hide true conditional independence relations. Thus, one can define minimal DPIN structures for p(U) in a manner exactly equivalent to that of UPIN structures: Deletion of an edge in a minimal DPIN structure GD implies an independence relation that does not hold in p(U) ∈ PGD . Similarly, GD is a perfect DPIN structure G for p(U) if GD is a DPIN structure for p(U) and all the conditional independence relations present in p(U) are represented by separation in GD . As with UPIN structures, minimal does not imply perfect for DPIN structures. For example, consider the independence relations X1 ⊥ X4 |{X2 , X3 } and X2 ⊥ X3 |{X1 , X4 }: the minimal DPIN structure contains an edge from X3 to X2 (see Fig. 4b). A complete ADG is trivially a DPIN structure for any probability model p(U). 3.2.2 Probability Functions on DPINs. A basic property of a DPIN structure is that it implies a direct factorization of the joint probability distribution p(U): N Y p(xi | pa(xi )), (3.3) p(u) = i=1
where pa(xi ) denotes a value assignment for the parents of Xi . A probability model p can be written in this factored form in a trivial manner by the conditioning rule. Note that a directed graph containing directed cycles does not necessarily yield such a factorization, hence the use of ADGs. 3.3 Differences between Directed and Undirected Graphical Representations. It is an important point that directed and undirected graphs possess different conditional independence semantics. There are common conditional independence relations that have perfect DPIN structures but no perfect UPIN structures, and vice versa (see Figure 4 for examples). Does a DPIN structure have the same Markov properties as the UPIN structure obtained by dropping all the directions on the edges in the DPIN structure? The answer is yes if and only if the DPIN structure contains no subgraphs where a node has two or more nonadjacent parents (Whittaker 1990; Pearl et al. 1990). In general, it can be shown that if a UPIN structure G for p is decomposable (triangulated), then it has the same Markov properties as some DPIN structure for p. On a more practical level, DPIN structures are frequently used to encode causal information, that is, to represent the belief formally that Xi precedes Xj in some causal sense (e.g., temporally). DPINs have found application in causal modeling in applied statistics and artificial intelligence. Their popularity in these fields stems from the fact that the joint probability model can
Probabilistic Independence Networks
237
(a)
(b)
Figure 4: (a) The DPIN structure to encode the fact that X3 depends on X1 and X2 but X1 ⊥ X2 . For example, consider that X1 and X2 are two independent coin flips and that X3 is a bell that rings when the flips are the same. There is no perfect UPIN structure that can encode these dependence relationships. (b) A UPIN structure that encodes X1 ⊥ X4 |{X2 , X3 } and X2 ⊥ X3 |{X1 , X4 }. There is no perfect DPIN structure that can encode these dependencies.
be specified directly by equation 3.3, that is, by the specification of conditional probability tables or functions (Spiegelhalter el al. 1991). In contrast, UPINs must be specified in terms of clique functions (as in equation 3.1), which may not be as easy to work with (cf. Geman and Geman 1984, Modestino and Zhang 1992, and Vandermeulen et al. 1994 for examples of ad hoc design of clique functions in image analysis). UPINs are more frequently used in problems such as image analysis and statistical physics where associations are thought to be correlational rather than causal. 3.4 From DPINs to (Decomposable) UPINs. The moral UPIN structure GM (obtained from the DPIN structure GD ) does not imply any new independence relations that are not present in GD . As with triangulation, however, the additional edges may obscure conditional independence relations implicit in the numeric specification of the original probability model p associated with the DPIN structure GD . Furthermore, GM may not be triangulated (decomposable). By the addition of appropriate edges, the moral graph can be converted to a (nonunique) triangulated graph G0 , namely, a decomposable cover for GM . In this manner, for any probability model p for which GD is a DPIN structure, one can construct a decomposable cover G0 for p. This mapping from DPIN structures to UPIN structures was first discussed in the context of efficient inference algorithms by Lauritzen and
238
Padhraic Smyth, David Heckerman, and Michael I. Jordan
Spiegelhalter (1988). The advantage of this mapping derives from the fact that analysis and manipulation of the resulting UPIN are considerably more direct than dealing with the original DPIN. Furthermore, it has been shown that many of the inference algorithms for DPINs are in fact special cases of inference algorithms for UPINs and can be considerably less efficient (Shachter et al. 1994). 4 Modeling HMMs as PINs 4.1 PINs for HMMs. In hidden Markov modeling problems (Baum and Petrie 1966; Poritz 1988; Rabiner 1989; Huang et al. 1990; Elliott et al. 1995) we are interested in the set of random variables U = {H1 , O1 , H2 , O2 , . . . , HN−1 , ON−1 , HN , ON }, where Hi is a discrete-valued hidden variable at index i, and Oi is the corresponding discrete-valued observed variable at index i, 1 ≤ i ≤ N. (The results here can be directly extended to continuous-valued observables.) The index i denotes a sequence from 1 to N, for example, discrete time steps. Note that Oi is considered univariate for convenience: the extension to the multivariate case with d observables is straightforward but is omitted here for simplicity since it does not illuminate the conditional independence relationships in the HMM. The well-known simple first-order HMM obeys the following two conditional independence relations: Hi ⊥ {H1 , O1 , . . . , Hi−2 , Oi−2 , Oi−1 } | Hi−1 ,
3≤i≤N
(4.1)
and Oi ⊥ {H1 , O1 , . . . , Hi−1 , Oi−1 } | Hi ,
2 ≤ i ≤ N.
(4.2)
We will refer to this “first-order” hidden Markov probability model as HMM(1,1): the notation HMM(K, J) is defined such that the hidden state of the model is represented via the conjoined configuration of J underlying random variables and such that the model has state memory of depth K. The notation will be clearer in later sections when we discuss specific examples with K, J > 1. Construction of a PIN for HMM(1,1) is simple. In the undirected case, assumption 1 requires that each state Hi is connected to Hi−1 only from the set {H1 , O1 , . . . , Hi−2 , Oi−2 , Hi−1 , Oi−1 }. Assumption 2 requires that Oi is connected only to Hi . The resulting UPIN structure for HMM(1,1) is shown in Figure 5a. This graph is singly connected and thus implies a decomposable probability model p for HMM(1,1), where the cliques are of the form {Hi , Oi } and {Hi−1 , Hi } (see Fig. 5b). In Section 5 we will see how the joint probability function can be expressed as a product function on the junction tree, thus leading to a junction tree definition of the familiar F-B and Viterbi inference algorithms.
Probabilistic Independence Networks
239
(a)
(b)
Figure 5: (a) PIN structure for HMM(1,1). (b) A corresponding junction tree.
For the directed case, the connectivity for the DPIN structure is the same. It is natural to choose the directions on the edges between Hi−1 and Hi as going from i − 1 to i (although the reverse direction could also be chosen without changing the Markov properties of the graph). The directions on the edges between Hi and Oi must be chosen as going from Hi to Oi rather than in the reverse direction (see Figure 6a). In reverse (see Fig. 6b) the arrows would imply that Oi is marginally independent of Hi−1 , which is not true in the HMM(1,1) probability model. The proper direction for the edges implies the correct relation, namely, that Oi is conditionally independent of Hi−1 given Hi . The DPIN structure for HMM(1,1) does not possess a subgraph with nonadjacent parents. As stated earlier, this implies that the implied independence properties of the DPIN structure are the same as those of the corresponding UPIN structure obtained by dropping the directions from the edges in the DPIN structure, and thus they both result in the same junction tree structure (see Fig. 5b). Thus, for the HMM(1,1) probability model, the minimal directed and undirected graphs possess the same Markov properties; they imply the same conditional independence relations. Furthermore, both PIN structures are perfect maps for the directed and undirected cases, respectively.
240
Padhraic Smyth, David Heckerman, and Michael I. Jordan
(a)
(b)
Figure 6: DPIN structures for HMM(1,1). (a) The DPIN structure for the HMM(1,1) probability model. (b) A DPIN structure that is not a DPIN structure for the HMM(1,1) probability model.
4.2 Inference and MAP Problems in HMMs. In the context of HMMs, the most common inference problem is the calculation of the likelihood of the observed evidence given the model, that is, p(o1 , . . . , oN |model), where the o1 , . . . , oN denote observed values for O1 , . . . , ON . (In this section we will assume that we are dealing with one particular model where the structure and parameters have already been determined, and thus we will not explicitly indicate conditioning on the model.) The “brute force” method for obtaining this probability would be to sum out the unobserved state variables from the full joint probability distribution: X p(H1 , o1 , . . . , HN , oN ), (4.3) p(o1 , . . . , oN ) = h1 ,...,hN
where hi denotes the possible values of hidden variable Hi . In general, both of these computations scale as mN where m is the number of states for each hidden variable. In practice, the F-B algorithm (Poritz 1988; Rabiner 1989) can perform these inference calculations with much lower complexity, namely, Nm2 . The likelihood of the observed evidence
Probabilistic Independence Networks
241
can be obtained with the forward step of the F-B algorithm: calculation of the state posterior probabilities requires both forward and backward steps. The F-B algorithm relies on a factorization of the joint probability function to obtain locally recursive methods. One of the key points in this article is that the graphical modeling approach provides an automatic method for determining such local efficient factorizations, for an arbitrary probabilistic model, if efficient factorizations exist given the conditional independence (CI) relations specified in the model. The MAP identification problem in the context of HMMs involves identifying the most likely hidden state sequence given the observed evidence. Just as with the inference problem, the Viterbi algorithm provides an efficient, locally recursive method for solving this problem with complexity Nm2 , and again, as with the inference problem, the graphical modeling approach provides an automatic technique for determining efficient solutions to the MAP problem for arbitrary models, if an efficient solution is possible given the structure of the model. 5 Inference and MAP Algorithms for DPINs Inference and MAP algorithms for DPINs and UPINS are quite similar: the UPIN case involves some subtleties not encountered in DPINs, and so discussion of UPIN inference and MAP algorithms is deferred until Section 7. The inference algorithm for DPINs (developed by Jensen et al. 1990, and hereafter referred to as the JLO algorithm) is a descendant of an inference algorithm first described by Lauritzen and Spiegelhalter (1988). The JLO algorithm applies to discrete-valued variables: extensions to the JLO algorithm for gaussian and gaussian-mixture distributions are discussed in Lauritzen and Wermuth (1989). A closely related algorithm to the JLO algorithm, developed by Dawid (1992a), solves the MAP identification problem with the same time complexity as the JLO inference algorithm. We show that the JLO and Dawid algorithms are strict generalizations of the well-known F-B and Viterbi algorithms for HMM(1,1) in that they can be applied to arbitrarily complex graph structures (and thus a large family of probabilistic models beyond HMM(1,1)) and handle missing values, partial inference, and so forth in a straightforward manner. There are many variations on the basic JLO and Dawid algorithms. For example, Pearl (1988) describes related versions of these algorithms in his early work. However, it can be shown (Shachter et al. 1994) that all known exact algorithms for inference on DPINs are equivalent at some level to the JLO and Dawid algorithms. Thus, it is sufficient to consider the JLO and Dawid algorithms in our discussion as they subsume other graphical inference algorithms.2 2
An alternative set of computational formalisms is provided by the statistical physics
242
Padhraic Smyth, David Heckerman, and Michael I. Jordan
The JLO and Dawid algorithms operate as a two-step process: 1. The construction step. This involves a series of substeps where the original directed graph is moralized and triangulated, a junction tree is formed, and the junction tree is initialized. 2. The propagation step. The junction tree is used in a local messagepassing manner to propagate the effects of observed evidence, that is, to solve the inference and MAP problems. The first step is carried out only once for a given graph. The second (propagation) step is carried out each time a new inference for the given graph is requested. 5.1 The Construction Step of the JLO Algorithm: From DPIN Structures to Junction Trees. We illustrate the construction step of the JLO algorithm using the simple DPIN structure, GD , over discrete variables U = {X1 , . . . , X6 } shown in Figure 7a. The JLO algorithm first constructs the moral graph GM (see Fig. 7b). It then triangulates the moral graph GM to obtain a decomposable cover G0 (see Fig. 7c). The algorithm operates in a simple, greedy manner based on the fact that a graph is triangulated if and only if all of its nodes can be eliminated, where a node can be eliminated whenever all of its neighbors are pairwise-linked. Whenever a node is eliminated, it and its neighbors define a clique in the junction tree that is eventually constructed. Thus, we can triangulate a graph and generate the cliques for the junction tree by eliminating nodes in some order, adding links if necessary. If no node can be eliminated without adding links, then we choose the node that can be eliminated by adding the links that yield the clique with the smallest state-space. After triangulation, the JLO algorithm constructs a junction tree from G0 (i.e., a clique tree satisfying the running intersection property). The junction tree construction is based on the following fact: Define the weight of a link literature, where undirected graphical models in the form of chains, trees, lattices, and “decorated” variations on chains and trees have been studied for many years (see, e.g., Itzykson and Drouff´e 1991). The general methods developed there, notably the transfer matrix formalism (e.g., Morgenstern and Binder 1983), support exact calculations on general undirected graphs. The transfer matrix recursions and the calculations in the JLO algorithm are closely related, and a reasonable hypothesis is that they are equivalent formalisms. (The question does not appear to have been studied in the general case, although see Stolorz 1994 and Saul and Jordan 1995 for special cases.) The appeal of the JLO framework, in the context of this earlier literature on exact calculations, is the link that it provides to conditional probability models (i.e., directed graphs) and the focus on a particular data structure—the junction tree—as the generic data structure underlying exact calculations. This does not, of course, diminish the potential importance of statistical physics methodology in graphical modeling applications. One area where there is clearly much to be gained from links to statistical physics is the area of approximate calculations, where a wide variety of methods are available (see, e.g., Swendsen and Wang 1987).
Probabilistic Independence Networks
243
(a)
(b)
(c)
(d)
Figure 7: (a) A simple DPIN structure GD . (b) The corresponding (undirected) moral graph GM . (c) The corresponding triangulated graph G0 . (d) The corresponding junction tree.
between two cliques as the number of variables in their intersection. Then a tree of cliques will satisfy the running intersection property if and only if it is a spanning tree of maximal weight. Thus, the JLO algorithm constructs a junction tree by choosing successively a link of maximal weight unless it creates a cycle. The junction tree constructed from the cliques defined by the DPIN structure triangulation in Figure 7c is shown in Figure 7d. The worst-case complexity is O(N3 ) for the triangulation heuristic and O(N2 log N) for the maximal spanning tree portion of the algorithm. This construction step is carried out only once as an initial step to convert the original graph to a junction tree representation. 5.2 Initializing the Potential Functions in the Junction Tree. The next step is to take the numeric probability specifications as defined on the directed graph GD (see equation 3.3) and convert this information into the general form for a junction tree representation of p (see equation 3.2). This is achieved by noting that each variable Xi is contained in at least one clique in the junction tree. Assign each Xi to just one such clique, and for each clique
244
Padhraic Smyth, David Heckerman, and Michael I. Jordan
define the potential function aC (C) to be either the product of p(Xi |pa(Xi )) over all Xi assigned to clique C, or 1 if no variables are assigned to that clique. Define the separator potentials (in equation 3.2) to be 1 initially. In the section that follows, we describe the general JLO algorithm for propagating messages through the junction tree to achieve globally consistent probability calculations. At this point it is sufficient to know that a schedule of local message passing can be defined that converges to a globally consistent marginal representation for p; that is, the potential on any clique or separator is the marginal for that clique or separator (the joint probability function). Thus, via local message passing, one can go from the initial potential representation defined above to a marginal representation: Q C∈V p(u) = Q C
p(xC )
S∈VS p(xS )
.
(5.1)
At this point the junction tree is initialized. This operation in itself is not that useful; of more interest is the ability to propagate information through the graph given some observed data and the initialized junction tree (e.g., to calculate the posterior distributions of some variables of interest). From this point onward we will implicitly assume that the junction tree has been initialized as described so that the potential functions are the local marginals. 5.3 Local Message Propagation in Junction Trees Using the JLO Algorithm. In general p(U) can be expressed as Q C∈V p(u) = Q C
aC (xC )
S∈VS bS (xS )
,
(5.2)
where the aC and bS are nonnegative potential functions (the potential functions could be the initial marginals described above, for example). Note that this representation is a generalization of the representations for p(u) given by equations 3.1 and 3.2. K = ({aC : C ∈ VC }, {bS : S ∈ SC }) is a representation for p(U). A factorizable function p(U) can admit many different representations, that is, many different sets of clique and separator functions that satisfy equation 5.2 given a particular p(U). The JLO algorithm carries out globally consistent probability calculations via local message passing on the junction tree; probability information is passed between neighboring cliques, and clique and separator potentials are updated based on this local information. A key point is that the cliques and separators are updated in a fashion that ensures that at all times K is a representation for p(U); in other words, equation 5.2 holds at all times. Eventually the propagation converges to the marginal representation given the initial model and the observed evidence.
Probabilistic Independence Networks
245
The message passing proceeds as follows: We can define a flow from clique Ci to Cj in the following manner where Ci and Cj are two cliques adjacent in the junction tree. Let Sk be the separator for these two cliques. Define X aCi (xCi ) (5.3) b∗Sk (xSk ) = Ci \Sk
where the summation is over the state-space of variables that are in Ci but not in Sk , and a∗Cj (xCj ) = aCj (xCj )λSk (xSk )
(5.4)
where λSk (xSk ) =
b∗Sk (xSk ) bSk (xSk )
.
(5.5)
λSk (xSk ) is the update factor. Passage of a flow corresponds to updating the neighboring clique with the probability information contained in the originating clique. This flow induces a new representation K∗ = ({a∗C : C ∈ VC }, {b∗S : S ∈ SC }) for p(U). A schedule of such flows can be defined such that all cliques are eventually updated with all relevant information and the junction tree reaches an equilibrium state. The most direct scheduling scheme is a two-phase operation where one node is denoted the root of the junction tree. The collection phase involves passing flows along all edges toward the root clique (if a node is scheduled to have more than one incoming flow, the flows are absorbed sequentially). Once collection is complete, the distribution phase involves passing flows out from this root in the reverse direction along the same edges. There are at most two flows along any edge in the tree in a nonredundant schedule. Note that the directionality of the flows in the junction tree need have nothing to do with any directed edges in the original DPIN structure. 5.4 The JLO Algorithm for Inference Given Observed Evidence. The particular case of calculating the effect of observed evidence (inference) is handled in the following manner: Consider that we observe evidence of the form e = {Xi = x∗i , Xj = xj∗ , . . .}, and Ue = {Xi , Xj , . . .} denotes the set of variables observed. Let Uh = U \ Ue denote the set of hidden or unobserved variables and uh a value assignment for Uh . Consider the calculation of p(Uh |e). Define an evidence function ge (xi ) such that ½ 1 if xi = x∗i e (5.6) g (xi ) = 0 otherwise.
246
Padhraic Smyth, David Heckerman, and Michael I. Jordan
Let f ∗ (u) = p(u)
Y
ge (xi ).
(5.7)
Ue
Thus, we have that f ∗ (u) ∝ p(uh |e). To obtain f ∗ (u) by operations on the junction tree, one proceeds as follows: First assign each observed variable Xi ∈ Ue to one particular clique that contains it (this is termed “entering the evidence into the clique”). Let CE denote the set of all cliques into which evidence is entered in this manner. For each C ∈ CE let Y ge (xi ) . (5.8) gC (xC ) = {i: Xi is entered into C}
Thus, f ∗ (u) = p(u) ×
Y
gC (xC ) .
(5.9)
C∈CE
One can now propagate the effects of these modifications throughout the tree using the collect-and-distribute schedule described in Section 5.3. Let xhC denote a value assignment of the hidden (unobserved) variables in clique C. When the schedule of flows is complete, one gets a new representation K∗f such that the local potential on each clique is f ∗ (xC ) = p(xhC , e), that is, the joint probability of the local unobserved clique variables and the observed evidence (Jensen et al. 1990) (similarly for the separator potential functions). If one marginalizes at the clique over the unobserved local clique variables, X
p(xhC , e) = p(e),
(5.10)
XCh
one gets the probability of the observed evidence directly. Similarly, if one normalizes the potential function at a clique to sum to one, one obtains the conditional probability of the local unobserved clique variables given the evidence, p(xhC |e). 5.5 Complexity of the Propagation Step of the JLO Algorithm. In general, the time complexity T of propagation within a junction tree is P C O( N i=1 s(Ci )) where NC is the number of cliques in the junction tree and s(Ci ) is the number of states in the clique state-space of Ci . Thus, for inference to be efficient, we need to construct junction trees with small clique sizes. Problems of finding optimally small junction trees (e.g., finding the junction tree with the smallest maximal clique) are NP-hard. Nonetheless, the heuristic algorithm for triangulation described earlier has been found to work well in practice (Jensen et al. 1990).
Probabilistic Independence Networks
247
6 Inference and MAP Calculations in HMM(1,1) 6.1 The F-B Algorithm for HMM(1,1) Is a Special Case of the JLO Algorithm. Figure 5b shows the junction tree for HMM(1,1). One can apply the JLO algorithm to the HMM(1,1) junction tree structure to obtain a particular inference algorithm for HMM(1,1). The HMM(1,1) inference problem consists of being given a set of values for the observable variables, e = {O1 = o1 , O2 = o2 , . . . , ON = oN }
(6.1)
and inferring the likelihood of e given the model. As described in the previous section, this problem can be solved exactly by local propagation in any junction tree using the JLO inference algorithm. In Appendix A it is shown that both the forward and backward steps of the F-B procedure for HMM(1,1) are exactly recreated by the more general JLO algorithm when the HMM(1,1) is viewed as a PIN. This equivalence is not surprising since both algorithms are solving exactly the same problem by local recursive updating. The equivalence is useful because it provides a link between well-known HMM inference algorithms and more general PIN inference algorithms. Furthermore, it clearly demonstrates how the PIN framework can provide a direct avenue for analyzing and using more complex hidden Markov probability models (we will discuss such HMMs in Section 8). When evidence is entered into the observable states and assuming m discrete states per hidden variable, the computational complexity of solving the inference problem via the JLO algorithm is O(Nm2 ) (the same complexity as the standard F-B procedure). Note that the obvious structural equivalence between PIN structures and HMM(1,1) has been noted before by Buntine (1994), Frasconi and Bengio (1994), and Lucke (1995) among others; however, this is the first publication of equivalence of specific inference algorithms as far as we are aware. 6.2 Equivalence of Dawid’s Propagation Algorithm for Identifying MAP Assignments and the Viterbi Algorithm. Consider that one wishes to calculate fˆ(uh , e) = maxx1 ,...,xK p(x1 , . . . , xK , e) and also to identify a set of values of the unobserved variables that achieve this maximum, where K is the number of unobserved (hidden) variables. This calculation can be achieved using a local propagation algorithm on the junction tree with two modifications to the standard JLO inference algorithm. This algorithm is due to Dawid (1992a); this is the most general algorithm from a set of related methods. First, during a flow, the marginalization of the separator is replaced by bˆS (xS ) = max aC (xC ), C\S
(6.2)
248
Padhraic Smyth, David Heckerman, and Michael I. Jordan
where C is the originating clique for the flow. The definition for λS (xS ) is also changed in the obvious manner. Second, marginalization within a clique is replaced by maximization: fˆC = max p(u) . u\xC
(6.3)
Given these two changes, it can be shown that if the same propagation operations are carried out as described earlier, the resulting representation Kˆ f at equilibrium is such that the potential function on each clique C is fˆ(xC ) = max p(xhC , e, {uh \ xC }), uh \xC
(6.4)
where xhC denotes a value assignment of the hidden (unobserved) variables in clique C. Thus, once the Kˆ f representation is obtained, one can locally identify the values of XCh , which maximize the full joint probability as xˆ hC = argxh fˆ(xC ) . C
(6.5)
In the probabilistic expert systems literature, this procedure is known as generating the most probable explanation (MPE) given the observed evidence (Pearl 1988). The HMM(1,1) MAP problem consists of being given a set of values for the observable variables, e = {O1 = o1 , O2 = o2 , . . . , ON = oN }, and inferring max p(h1 , . . . , hN , e)
h1 ,...,hN
(6.6)
or the set of arguments that achieve this maximum. Since Dawid’s algorithm is applicable to any junction tree, it can be directly applied to the HMM(1,1) junction tree in Figure 5b. In Appendix B it is shown that Dawid’s algorithm, when applied to HMM(1,1), is exactly equivalent to the standard Viterbi algorithm. Once again the equivalence is not surprising: Dawid’s method and the Viterbi algorithm are both direct applications of dynamic programming to the MAP problem. However, once again, the important point is that Dawid’s algorithm is specified for the general case of arbitrary PIN structures and can thus be directly applied to more complex HMMs than HMM(1,1) (such as those discussed later in Section 8). 7 Inference and MAP Algorithms for UPINs In Section 5 we described the JLO algorithm for local inference given a DPIN. For UPINs the procedure is very similar except for two changes to
Probabilistic Independence Networks
249
the overall algorithm: the moralization step is not necessary, and initialization of the junction tree is less trivial. In Section 5.2 we described how to go from a specification of conditional probabilities in a directed graph to an initial potential function representation on the cliques in the junction tree. To utilize undirected links in the model specification process requires new machinery to perform the initialization step. In particular we wish to compile the model into the standard form of a product of potentials on the cliques of a triangulated graph (cf. equation 3.1): P(u) =
Y
aC (xC ) .
C∈VC
Once this initialization step has been achieved, the JLO propagation procedure proceeds as before. Consider the chordless cycle shown in Figure 4b. Suppose that we parameterize the probability distribution on this graph by specifying pairwise marginals on the four pairs of neighboring nodes. We wish to convert such a local specification into a globally consistent joint probability distribution, that is, a marginal representation. An algorithm known as iterative proportional fitting (IPF) is available to perform this conversion. Classically, IPF proceeds as follows (Bishop et al. 1973): Suppose for simplicity that all of the random variables are discrete (a gaussian version of IPF is also available, Whittaker 1990) such that the joint distribution can be represented as a table. The table is initialized with equal values in all of the cells. For each marginal in turn, the table is then rescaled by multiplying every cell by the ratio of the desired marginal to the corresponding marginal in the current table. The algorithm visits each marginal in turn, iterating over the set of marginals. If the set of marginals is consistent with a single joint distribution, the algorithm is guaranteed to converge to the joint distribution. Once the joint is available, the potentials in equation 3.1 can be obtained (in principle) by marginalization. Although IPF solves the initialization problem in principle, it is inefficient. Jiˇrousek and Pˇreuˇcil (1995) developed an efficient version of IPF that avoids the need for both storing the joint distribution as a table and explicit marginalization of the joint to obtain the clique potentials. Jiˇrousek’s version of IPF represents the evolving joint distribution directly in terms of junction tree potentials. The algorithm proceeds as follows: Let I be a set of subsets of V. For each I ∈ I , let q(xI ) denote the desired marginal on the subset I. Let the joint distribution be represented as a product over junction tree potentials (see equation 3.1), where each aC is initialized to an arbitrary constant. Visit each I ∈ I in turn, updating the corresponding clique potential aC (i.e, that potential aC for which I ⊆ C) as follows: a∗C (xC ) = aC (xC )
q(xI ) . p(xI )
250
Padhraic Smyth, David Heckerman, and Michael I. Jordan
The marginal p(xI ) is obtained via the JLO algorithm, using the current set of clique potentials. Intelligent choices can be made for the order in which to visit the marginals to minimize the amount of propagation needed to compute p(xI ). This algorithm is simply an efficient way of organizing the IPF calculations and inherits the latter’s guarantees of convergence. Note that the Jiˇrousek and Pˇreuˇcil algorithm requires a triangulation step in order to form the junction tree used in the calculation of p(xI ). In the worst case, triangulation can yield a highly connected graph, in which case the Jiˇrousek and Pˇreuˇcil algorithm reduces to classical IPF. For sparse graphs, however, when the maximum clique is much smaller than the entire graph, the algorithm should be substantially more efficient than classical IPF. Moreover, the triangulation algorithm itself need only be run once as a preprocessing step (as is the case for the JLO algorithm). 8 More Complex HMMs for Speech Modeling Although HMMs have provided an exceedingly useful framework for the modeling of speech signals, it is also true that the simple HMM(1,1) model underlying the standard framework has strong limitations as a model of speech. Real speech is generated by a set of coupled dynamical systems (lips, tongue, glottis, lungs, air columns, etc.), each obeying particular dynamical laws. This coupled physical process is not well modeled by the unstructured state transition matrix of HMM(1,1). Moreover, the first-order Markov properties of HMM(1,1) are not well suited to modeling the ubiquitous coarticulation effects that occur in speech, particularly coarticulatory effects that extend across several phonemes (Kent and Minifie 1977). A variety of techniques have been developed to surmount these basic weaknesses of the HMM(1,1) model, including mixture modeling of emission probabilities, triphone modeling, and discriminative training. All of these methods, however, leave intact the basic probabilistic structure of HMM(1,1) as expressed by its PIN structure. In this section we describe several extensions of HMM(1,1) that assume additional probabilistic structure beyond that assumed by HMM(1,1). PINs provide a key tool in the study of these more complex models. The role of PINs is twofold: they provide a concise description of the probabilistic dependencies assumed by a particular model, and they provide a general algorithm for computing likelihoods. This second property is particularly important because the existence of the JLO algorithm frees us from having to derive particular recursive algorithms on a case-by-case basis. The first model that we consider can be viewed as a coupling of two HMM(1,1) chains (Saul and Jordan, 1995). Such a model can be useful in general sensor fusion problems, for example, in the fusion of an audio signal with a video signal in lipreading. Because different sensory signals generally have different bandwidths, it may be useful to couple separate Markov models that are developed specifically for each of the individual signals. The
Probabilistic Independence Networks
251
alternative is to force the problem into an HMM(1,1) framework by either oversampling the slower signal, which requires additional parameters and leads to a high-variance estimator, or downsampling the faster signal, which generally oversmoothes the data and yields a biased estimator. Consider the HMM(1,2) structure shown in Figure 8a. This model involves two HMM(1,1) backbones that are coupled together by undirected links between the state variables. Let Hi(1) and O(1) i denote the ith state and ith output of the “fast” chain, respectively, and let Hi(2) and O(2) i denote the ith state and ith output of the “slow” chain. Suppose that the fast chain is sampled τ times as often is connected to Hi(2) for i0 equal to τ (i − 1) + 1. as the slow chain. Then Hi(1) 0 0 Given this value for i , the Markov model for the coupled chain implies the following conditional independencies for the state variables: (1) (1) (2) (2) (1) (1) (2) (2) (2) {Hi(1) 0 , Hi } ⊥ {H1 , O1 , H1 , O1 , . . . , Hi0 −2 , Oi0 −2 , Hi−2 , Oi−2 , (2) (1) (2) O(1) i0 −1 , Oi−1 } | {Hi0 −1 , Hi−1 },
(8.1)
as well as the following conditional independencies for the output variables: (1) (1) (2) (2) (1) (1) (2) {O(1) i0 , Oi } ⊥ {H1 , O1 , H1 , O1 , . . . , Hi0 −1 , Oi0 −1 , (2) (1) (2) , O(2) Hi−1 i−1 } | {Hi0 , Hi } .
(8.2)
Additional conditional independencies can be read off the UPIN structure (see Figure 8a). As is readily seen in Figure 8a, the HMM(1,2) graph is not triangulated; thus, the HMM(1,2) probability model is not decomposable. However, the graph can be readily triangulated to form a decomposable cover for the HMM(1,2) probability model (see Section 3.1.2). The JLO algorithm provides an efficient algorithm for calculating likelihoods in this graph. This can be seen in Figure 8b, where we show a triangulation of the HMM(1,2) graph. The triangulation adds O(Nh ) links to the graph (where Nh is the number of hidden nodes in the graph) and creates a junction tree in which each clique is a cluster of three state variables from the underlying UPIN structure. Assuming m values for each state variable in each chain, we obtain an algorithm whose time complexity is O(Nh m3 ). This can be compared to the naive approach of transforming the HMM(1,2) model to a Cartesian product HMM(1,1) model, which not only has the disadvantage of requiring subsampling or oversampling but also has a time complexity of O(Nh m4 ). Directed graph semantics can also play an important role in constructing interesting variations on the HMM theme. Consider Figure 9a, which shows an HMM(1,2) model in which a single output stream is coupled to a pair of underlying state sequences. In a speech modeling application, such a structure might be used to capture the fact that a given acoustic pattern can have multiple underlying articulatory causes. For example, equivalent shifts in
252
Padhraic Smyth, David Heckerman, and Michael I. Jordan
(a)
(b)
Figure 8: (a) The UPIN structure for the HMM(1,2) model with τ = 2. (b) A triangulation of this UPIN structure.
Probabilistic Independence Networks
253
formant frequencies can be caused by lip rounding or tongue raising; such phenomena are generically refered to as “trading relations” in the speech psychophysics literature (Lindblom 1990; Perkell et al. 1993). Once a particular acoustic pattern is observed, the causes become dependent; thus, for example, evidence that the lips are rounded would act to discount inferences that the tongue has been raised. These inferences propagate forward and backward in time and couple the chains. Formally, these induced dependencies are accounted for by the links added between the state sequences during the moralization of the graph (see Figure 9b). This figure shows that the underlying calculations for this model are closely related to those of the earlier HMM(1,2), but the model specification is very different in the two cases. Saul and Jordan (1996) have proposed a second extension of the HMM(1,1) model that is motivated by the desire to provide a more effective model of coarticulation (see also Stolorz 1994). In this model, shown in Figure 10, coarticulatory influences are modeled by additional links between output variables and states along an HMM(1,1) backbone. One approach to performing calculations in this model is to treat it as a Kth-order Markov chain and transform it into an HMM(1,1) model by defining higher-order state variables. A graphical modeling approach is more flexible. It is possible, for example, to introduce links between states and outputs K time steps apart without introducing links for the intervening time intervals. More generally, the graphical modeling approach to the HMM(K,1) model allows the specification of different interaction matrices at different time scales; this is awkward in the Kth-order Markov chain formalism. The HMM(3,1) graph is triangulated as is, and thus the time complexity of the JLO algorithm is O(Nh m3 ). In general an HMM(K,1) graph creates cliques of size O(mK ), and the JLO algorithm runs in time O(Nh mK ). As these examples suggest, the graphical modeling framework provides a useful framework for exploring extensions of HMMs. The examples also make clear, however, that the graphical algorithms are no panacea. The mK complexity of HMM(K,1) will be prohibitive for large K. Also, the generalization of HMM(1,2) to HMM(1,K) (couplings of K chains) is intractable. Recent research has therefore focused on approximate algorithms for inference in such structures; see Saul and Jordan (1996) for HMM(K,1) and Ghahramani and Jordan (1996) and Williams and Hinton (1990) for HMM(1,K). These authors have developed an approximation methodology based on mean-field theory from statistical physics. While discussion of mean-field algorithms is beyond the scope of this article, it is worth noting that the graphical modeling framework plays a useful role in the development of these approximations. Essentially the mean-field approach involves creating a simplified graph for which tractable algorithms are available, and minimizing a probabilistic distance between the tractable graph and the intractable graph. The JLO algorithm is called as a subroutine on the tractable graph during the minimization process.
254
Padhraic Smyth, David Heckerman, and Michael I. Jordan
(a)
(b)
Figure 9: (a) The DPIN structure for HMM(1,2) with a single observable sequence coupled to a pair of underlying state sequences. (b) The moralization of this DPIN structure.
9 Learning and PINs Until now, we have assumed that the parameters and structure of a PIN are known with certainty. In this section, we drop this assumption and discuss methods for learning about the parameters and structure of a PIN. The basic idea behind the techniques that we discuss is that there is a true joint probability distribution described by some PIN structure and parameters, but we are uncertain about this structure and its parameters. We
Probabilistic Independence Networks
255
Figure 10: The UPIN structure for HMM(3,1).
are unable to observe the true joint distribution directly, but we are able to observe a set of patterns u1 , . . . , uM that is a random sample from this true distribution. These patterns are independent and identically distributed (i.i.d.) according to the true distribution (note that in a typical HMM learning problem, each of the ui consist of a sequence of observed data). We use these data to learn about the structure and parameters that encode the true distribution. 9.1 Parameter Estimation for PINs. First, let us consider the situation where we know the PIN structure S of the true distribution with certainty but are uncertain about the parameters of S. In keeping with the rest of the article, let us assume that all variables in U are discrete. Furthermore, for purposes of illustration, let us assume that S is an ADG. Let xki and pa(Xi ) j denote the kth value of variable Xi and jth configuration of variables pa(Xi ) in S, respectively (j = 1, . . . , qi , k = 1, . . . , ri ). As we have just discussed, we assume that each conditional probability p(xki |pa(Xi ) j ) is an uncertain parameter, and for convenience we represent this parameter as θijk . We use θ ij to denote the vector of parameters , . . . , θijri ) and θ s to denote the vector of all parameters for S. Note that (θ Pij1 ri k=1 θijk = 1 for every i and j. One method for learning about the parameters θ s is the Bayesian approach. We treat the parameters θ s as random variables, assign these parameters a prior distribution p(θ s |S), and update this prior distribution with data D = (u1 , . . . , uM ) according to Bayes’ rule: p(θ s | D, S) = c · p(θ s | S) p(D | θ s , S),
(9.1)
where c is a normalization constant that does not depend on θ s . Because the patterns in D are a random sample, equation 9.1 simplifies to p(θ s | D, S) = c · p(θ s | S)
M Y l=1
p(ul | θ s , S) .
(9.2)
256
Padhraic Smyth, David Heckerman, and Michael I. Jordan
(a)
(b)
Figure 11: A Bayesian network structure for a two-binary-variable domain {X1 , X2 } showing (a) conditional independencies associated with the random sample assumption and (b) the added assumption of parameter independence. In both parts of the figure, it is assumed that the network structure X1 → X2 is generating the patterns.
Given some prediction of interest that depends on θ s and S—say, f (θ s , S)— we can use the posterior distribution of θ s to compute an expected prediction: Z f (θ s , S) p(θ s | D, S) dθ s . (9.3) E( f (θ s , S) | D, S) = Associated with our assumption that the data D are a random sample from structure S with uncertain parameters θ s is a set of conditional independence assertions. Not surprisingly, some of these assumptions can be represented as a (directed) PIN that includes both the possible observations and the parameters as variables. Figure 11a shows these assumptions for the case where U = {X1 , X2 } and S is the structure with a directed edge from X1 to X2 . Under certain additional assumptions, described, for example, in Spiegelhalter and Lauritzen (1990), the evaluation of equation 9.2 is straightfor-
Probabilistic Independence Networks
257
ward. In particular, if each pattern ul is complete (i.e., every variable is observed), we have p(ul | θ s , S) =
qi Y ri N Y Y
δijkl
θijk ,
(9.4)
i=1 j=1 k=1
where δijkl is equal to one if Xi = xki and pa(Xi ) = pa(Xi ) j in pattern Cl and zero otherwise. Combining equations 9.2 and 9.4, we obtain p(θ s | D, S) = c · p(θ s | S)
qi Y ri N Y Y
Nijk
θijk ,
(9.5)
i=1 j=1 k=1
where Nijk is the number of patterns in which Xi = xki and pa(Xi ) = pa(Xi ) j . The Nijk are the sufficient statistics for the random sample D. If we assume that the parameter vectors θ ij , i = 1, . . . , n, j = 1, . . . , qi are mutually independent, an assumption we call parameter independence, then we get the additional simplification p(θ s | D, S) = c
qi N Y Y
p(θ ij | S)
i=1 j=1
ri Y
Nijk
θijk .
(9.6)
k=1
The assumption of parameter independence for our two-variable example is illustrated in Figure 11b. Thus, given complete data and parameter independence, each parameter vector θ ij can be updated independently. The update is particularly simple if each parameter vector has a conjugate distribution. For a discrete variable with discrete parents, the natural conjugate distribution is the Dirichlet, p(θ ij | S) ∝
ri Y
αijk −1
θijk
,
k=1
in which case equation 9.6 becomes p(θ s | D, S) = c
qi Y ri N Y Y
Nijk +αijk −1
θijk
.
(9.7)
i=1 j=1 k=1
Other conjugate distributions include the normal Wishart distribution for the parameters of gaussian codebooks and the Dirichlet distribution for the mixing coefficients of gaussian-mixture codebooks (DeGroot 1970; Buntine 1994; Heckerman and Geiger 1995). Heckerman and Geiger (1995) describe a simple method for assessing these priors. These priors have also been used for learning parameters in standard HMMs (e.g., Gauvain and Lee 1994).
258
Padhraic Smyth, David Heckerman, and Michael I. Jordan
Parameter independence is usually not assumed in general for HMM structures. For example, in the HMM(1,1) model, a standard assumption is that p(Hi |Hi−1 ) = p(Hj |Hj−1 ) and p(Oi |Hi ) = p(Oj |Hj ) for all appropriate i and j. Fortunately, parameter equalities such as these are easily handled in the framework above (see Thiesson 1995 for a detailed discussion). In addition, the assumption that patterns are complete is clearly inappropriate for HMM structures in general, where some of the variables are hidden from observation. When data are missing, the exact evaluation of the posterior p(θ s |D, S) is typically intractable, so we turn to approximations. Accurate but slow approximations are based on Monte Carlo sampling (e.g., Neal 1993). An approximation that is less accurate but more efficient is one based on the observation that, under certain conditions, the quantity p(θ s |S) · p(D|θ s , S) converges to a multivariate gaussian distribution as the sample size increases (see, e.g., Kass et al. 1988; MacKay, 1992a, 1992b). Less accurate but more efficient approximations are based on the observation that the gaussian distribution converges to a delta function centered at the maximum a posteriori (MAP) and eventually the maximum likelihood (ML) value of θ s . For the standard HMM(1,1) model discussed in this article, where either discrete, gaussian, or gaussian-mixture codebooks are used, an ML or MAP estimate is a well-known efficient approximation (Poritz 1988; Rabiner 1989). MAP and ML estimates can be found using traditional techniques such as gradient descent and expectation-maximization (EM) (Dempster et al., 1977). The EM algorithm can be applied efficiently whenever the likelihood function has sufficient statistics that are of fixed dimension for any data set. The EM algorithm finds a local maximum by initializing the parameters θ s (e.g., at random or by some clustering algorithm) and repeating E and M steps to convergence. In the E step, we compute the expected sufficient statistic for each of the parameters, given D and the current values for θ s . In particular, if all variables are discrete and parameter independence is assumed to hold, and all priors are Dirichlet, we obtain E(Nijk | D, θ s , S) =
M X
p(xki , pa(Xi ) j | ul , θ s , S).
l=1
An important feature of the EM algorithm applied to PINs under these assumptions is that each term in the sum can be computed using the JLO algorithm. The JLO algorithm may also be used when some parameters are equal and when the likelihoods of some variables are gaussian or gaussianmixture distributions (Lauritzen and Wermuth 1989). In the M step, we use the expected sufficient statistics as if they were actual sufficient statistics and set the new values of θ s to be the MAP or ML values given these statistics. Again, if all variables are discrete, parameter independence is assumed to
Probabilistic Independence Networks
259
hold, and all priors are Dirichlet, the ML is given by E(Nijk | D, θ s , S) , θijk = Pri k=1 E(Nijk | D, θ s , S) and the MAP is given by θijk = Pri
E(Nijk | D, θ s , S) + αijk − 1
k=1 (E(Nijk
| D, θ s , S) + αijk − 1)
.
9.2 Model Selection and Averaging for PINs. Now let us assume that we are uncertain not only about the parameters of a PIN but also about the true structure of a PIN. For example, we may know that the true structure is an HMM(K, J) structure, but we may be uncertain about the values of K and J. One solution to this problem is Bayesian model averaging. In this approach, we view each possible PIN structure (without its parameters) as a model. We assign prior probabilities p(S) to different models, and compute their posterior probabilities given data: Z p(S | D) ∝ p(S) p(D | S) = p(S)
p(D | θ , S) p(θ | S) dθ .
(9.8)
As indicated in equation 9.8, we compute p(D|S) by averaging the likelihood of the data over the parameters of S. In addition to computing the posterior probabilities of models, we estimate the parameters of each model either by computing the distribution p(θ |D, S) or using a gaussian, MAP, or ML approximation for this distribution. We then make a prediction of interest based on each model separately, as in equation 9.3, and compute the weighted average of these predictions using the posterior probabilities of models as weights. One complication with this approach is that when data are missing—for example, when some variables are hidden—the exact computation of the integral in equation 9.8 is usually intractable. As discussed in the previous section, Monte Carlo and gaussian approximations may be used. One simple form of a gaussian approximation is the Bayesian information criterion (BIC) described by Schwarz (1978), d log p(D | S) ≈ log p(D | θˆs , S) − log M, 2 where θˆs is the ML estimate, M is the number of patterns in D, and d is the dimension of S—typically, the number of parameters of S. The first term of this “score” for S rewards how well the data fit S, whereas the second term punishes model complexity. Note that this score does not depend on the
260
Padhraic Smyth, David Heckerman, and Michael I. Jordan
parameter prior, and thus can be applied easily.3 For examples of applications of BIC in the context of PINs and other statistical models, see Raftery (1995). The BIC score is the additive inverse of Rissanen’s (1987) minimum description length (MDL). Other scores, which can be viewed as approximations to the marginal likelihood, are hypothesis testing (Raftery 1995) and cross validation (Dawid 1992b). Buntine (in press) provides a comprehensive review of scores for model selection and model averaging in the context of PINs. Another complication with Bayesian model averaging is that there may be so many possible models that averaging becomes intractable. In this case, we select one or a handful of structures with high relative posterior probabilities and make our predictions with this limited set of models. This approach is called model selection. The trick here is finding a model or models with high posterior probabilities. Detailed discussions of search methods for model selection among PINs are given by, among others, Madigan and Raftery (1994), Heckerman et al. (1995), and Spirtes and Meek (1995). When the true model is some HMM(K, J) structure, we may have additional prior knowledge that strongly constrains the possible values of K and J. Here, exhaustive model search is likely to be practical. 10 Summary Probabilistic independence networks provide a useful framework for both the analysis and application of multivariate probability models when there is considerable structure in the model in the form of conditional independence. The graphical modeling approach both clarifies the independence semantics of the model and yields efficient computational algorithms for probabilistic inference. This article has shown that it is useful to cast HMM structures in a graphical model framework. In particular, the well-known F-B and Viterbi algorithms were shown to be special cases of more general algorithms from the graphical modeling literature. Furthermore, more complex HMM structures, beyond the traditional first-order model, can be analyzed profitably and directly using generally applicable graphical modeling techniques. Appendix A: The Forward-Backward Algorithm for HMM(1,1) Is a Special Case of the JLO Algorithm Consider the junction tree for HMM(1,1) as shown in Figure 5b. Let the final clique in the chain containing (HN−1 , HN ) be the root clique. Thus, a 3 One caveat: The BIC score is derived under the assumption that the parameter prior is positive throughout its domain.
Probabilistic Independence Networks
261
nonredundant schedule consists of first recursively passing flows from each (Oi , Hi ) and (Hi−2 , Hi−1 ) to each (Hi−1 , Hi ) in the appropriate sequence (the “collect” phase), and then distributing flows out in the reverse direction from the root clique. If we are interested only in calculating the likelihood of e given the model, then the distribute phase is not necessary since we can simply marginalize over the local variables in the root clique to obtain p(e). (Subscripts on potential functions and update factors indicate which variables have been used in deriving that potential or update factor; e.g., fO1 indicates that this potential has been updated based on information about O1 but not using information about any other variables.) Assume that the junction tree has been initialized so that the potential function for each clique and separator is the local marginal. Given the observed evidence e, each individual piece of evidence O = o∗i is entered into its clique (Oi , Hi ) such that each clique marginal becomes fO∗ i (hi , oi ) = p(hi , o∗i ) after entering the evidence (as in equation 5.8). Consider the portion of the junction tree in Figure 12, and in particular the flow between (Oi , Hi ) and (Hi−1 , Hi ). By definition the potential on the separator Hi is updated to fO∗ i (hi ) =
X
f ∗ (hi , oi ) = p(hi , o∗i ).
(A.1)
oi
The update factor from this separator flowing into clique (Hi−1 , Hi ) is then λOi (hi ) =
p(hi , o∗i ) = p(o∗i | hi ). p(hi )
(A.2)
This update factor is “absorbed” into (Hi−1 , Hi ) as follows: fO∗ i (hi−1 , hi ) = p(hi−1 , hi )λOi (hi ) = p(hi−1 , hi )p(o∗i | hi ).
(A.3)
Now consider the flow from clique (Hi−2 , Hi−1 ) to clique (Hi−1 , Hi ). Let 8i,j = {Oi , . . . , Oj } denote a set of consecutive observable variables and ∗ = {o∗ , . . . , o∗ } denote a set of observed values for these variables, 1 ≤ φi,j i j i < j ≤ N. Assume that the potential on the separator Hi−1 has been updated to ∗ ) f8∗ 1,i−1 (hi−1 ) = p∗ (hi−1 , φ1,i−1
(A.4)
by earlier flows in the schedule. Thus, the update factor on separator Hi−1 becomes λ81,i−1 (hi−1 ) =
∗ ) p∗ (hi−1 , φ1,i−1
p(hi−1 )
,
(A.5)
262
Padhraic Smyth, David Heckerman, and Michael I. Jordan
Figure 12: Local message passing in the HMM(1,1) junction tree during the collect phase of a left-to-right schedule. Ovals indicate cliques, boxes indicate separators, and arrows indicate flows.
and this gets absorbed into clique (Hi−1 , Hi ) to produce f8∗ 1,i (hi−1 , hi ) = fO∗ i (hi−1 , hi )λ81,i−1 (hi−1 ) = p(hi−1 , hi )p(o∗i | hi )
∗ ) p∗ (hi−1 , φ1,i−1
p(hi−1 ) ∗ = p(o∗i | hi )p(hi | hi−1 )p∗ (hi−1 , φ1,i−1 ).
(A.6)
Finally, we can calculate the new potential on the separator for the flow from clique (Hi−1 , Hi ) to (Hi , Hi+1 ): f8∗ 1,i (hi ) =
X
f8∗ 1,i (hi−1 , hi )
hi−1
= p(o∗i | hi )
X
(A.7)
∗ p(hi | hi−1 )p∗ (hi−1 , φ1,i−1 )
(A.8)
p(hi | hi−1 ) f8∗ 1,i−1 (hi−1 ) .
(A.9)
hi−1
= p(o∗i | hi )
X hi−1
Proceeding recursively in this manner, one finally obtains at the root clique ∗ f8∗ 1,N (hN−1 , hN ) = p(hN−1 , hN , φ1,N )
(A.10)
from which one can get the likelihood of the evidence, ∗ )= p(e) = p(φ1,N
X hN−1 ,hN
f8∗ 1,N (hN−1 , hN ).
(A.11)
Probabilistic Independence Networks
263
We note that equation A.9 directly corresponds to the recursive equation (equation 20 in Rabiner 1989) for the α variables used in the forward phase of the F-B algorithm, the standard HMM(1,1) inference algorithm. In particular, using a “left-to-right” schedule, the updated potential functions on the separators between the hidden cliques, the f8∗ 1,i (hi ) functions, are exactly the α variables. Thus, when applied to HMM(1,1), the JLO algorithm produces exactly the same local recursive calculations as the forward phase of the F-B algorithm. One can also show an equivalence between the backward phase of the F-B algorithm and the JLO inference algorithm. Let the “leftmost” clique in the chain, (H1 , H2 ), be the root clique, and define a schedule such that the flows go from right to left. Figure 13 shows a local portion of the clique tree and the associated flows. Consider that the potential on clique (Hi , Hi+1 ) has been updated already by earlier flows from the right. Thus, by definition, ∗ ). f8∗ i+1,N (hi , hi+1 ) = p(hi , hi+1 , φi+1,N
(A.12)
The potential on the separator between (Hi , Hi+1 ) and (Hi−1 , Hi ) is calculated as f8∗ i+1,N (hi ) =
X
∗ p(hi , hi+1 , φi+1,N )
hi+1
= p(hi )
X
(A.13)
∗ p(hi+1 | hi )p(o∗i+1 | hi+1 )p(φi+2,N | hi+1 )
(A.14)
hi+1
(by virtue of the various conditional independence relations in HMM(1,1)) = p(hi )
X
p(hi+1 | hi )p(o∗i+1 | hi+1 )
∗ , hi+1 ) p(φi+2,N
p(hi+1 )
hi+1
= p(hi )
X
p(hi | hi+1 )p(o∗i+1 | hi+1 )
f8∗ i+2,N (hi+1 )
hi+1
p(hi+1 )
.
(A.15) (A.16)
Defining the update factor on this separator yields λ∗8i+1,N (hi ) =
f8∗ i+2,N (hi )
p(hi ) X f8∗ (hi+1 ) p(hi | hi+1 )p(o∗i+1 | hi+1 ) i+2,N = p(hi+1 ) hi+1 X = p(hi | hi+1 )p(o∗i+1 | hi+1 )λ∗8i+2,N (hi+1 ) . hi+1
(A.17) (A.18) (A.19)
264
Padhraic Smyth, David Heckerman, and Michael I. Jordan
Figure 13: Local message passing in the HMM(1,1) junction tree during the collect phase of a right-to-left schedule. Ovals indicate cliques, boxes indicate separators, and arrows indicate flows.
This set of recursive equations in λ corresponds exactly to the recursive equation (equation 25 in Rabiner 1989) for the β variables in the backward phase of the F-B algorithm. In fact, the update factors λ on the separators are exactly the β variables. Thus, we have shown that the JLO inference algorithm recreates the F-B algorithm for the special case of the HMM(1,1) probability model. Appendix B: The Viterbi Algorithm for HMM(1,1) Is a Special Case of Dawid’s Algorithm. As with the inference problem, let the final clique in the chain containing (HN−1 , HN ) be the root clique and use the same schedule: first a left-to-right collection phase into the root clique, followed by a right-to-left distribution phase out from the root clique. Again it is assumed that the junction tree has been initialized so that the potential functions are the local marginals, and the observable evidence e has been entered into the cliques in the same manner as described for the inference algorithm. We refer again to Figure 12. The sequence of flow and absorption operations is identical to that of the inference algorithm with the exception that marginalization operations are replaced by maximization. Thus, the potential on the separator between (Oi , Hi ) and (Hi−1 , Hi ) is initially updated to fˆOi (hi ) = max p(hi , oi ) = p(hi , o∗i ). oi
(B.1)
Probabilistic Independence Networks
265
The update factor for this separator is λOi (hi ) =
p(hi , o∗i ) = p(o∗i | hi ), p(hi )
(B.2)
and after absorption into the clique (Hi−1 , Hi ) one gets fˆOi (hi−1 , hi ) = p(hi−1 , hi )p(o∗i | hi ).
(B.3)
Now consider the flow from clique (Hi−2 , Hi−1 ) to (Hi−1 , Hi ). Let Hi,j = {Hi , . . . , Hj } denote a set of consecutive observable variables and h∗i,j = {h∗i , . . . , hj∗ }, denote the observed values for these variables, 1 ≤ i < j ≤ N. Assume that the potential on separator Hi−1 has been updated to ∗ ) fˆ81,i−1 (hi−1 ) = max p(hi−1 , h1,i−2 , φ1,i−1
(B.4)
h1,i−2
by earlier flows in the schedule. Thus, the update factor for separator Hi−1 becomes λ81,i−1 (hi−1 ) =
∗ ) maxh1,i−2 p(hi−1 , h1,i−2 , φ1,i−1
p(hi−1 )
,
(B.5)
and this gets absorbed into clique (Hi−1 , Hi ) to produce (B.6) fˆ81,i (hi−1 , hi ) = fˆOi (hi−1 , hi )λ81,i−1 (hi−1 ) ∗ maxh1,i−2 p(hi−1 , h1,i−2 , φ1,i−1 ) . (B.7) = p(hi−1 , hi )p(o∗i | hi ) p(hi−1 ) We can now obtain the new potential on the separator for the flow from clique (Hi−1 , Hi ) to (Hi , Hi+1 ), fˆ81,i (hi ) = max fˆ81,i (hi−1 , hi ) =
hi−1 p(o∗i
(B.8)
∗ | hi ) max{p(hi | hi−1 ) max p(hi−1 , h1,i−2 , φ1,i−1 )} hi−1
h1,i−2
(B.9)
∗ = p(o∗i | hi ) max{p(hi | hi−1 )p(hi−1 , h1,i−2 , φ1,i−1 )}
(B.10)
∗ = max p(hi , h1,i−1 , φ1,i ),
(B.11)
h1,i−1
h1,i−1
which is the result one expects for the updated potential at this clique. Thus, we can express the separator potential fˆ81,i (hi ) recursively (via equation B.10) as fˆ81,i (hi ) = p(o∗i | hi ) max{p(hi | hi−1 ) fˆ81,i−1 (hi−1 )}. hi−1
(B.12)
266
Padhraic Smyth, David Heckerman, and Michael I. Jordan
This is the same recursive equation as used in the δ variables in the Viterbi algorithm (equation 33a in Rabiner 1989): the separator potentials in Dawid’s algorithm using a left-to-right schedule are exactly the same as the δ’s used in the Viterbi method for solving the MAP problem in HMM(1,1). Proceeding recursively in this manner, one finally obtains at the root clique ∗ ), fˆ81,N (hN−1 , hN ) = max p(hN−1 , hN , hN−2 , φ1,N h1,N−2
(B.13)
from which one can get the likelihood of the evidence given the most likely state of the hidden variables: fˆ(e) = max fˆ81,N (hN−1 , hN ) hN−1 ,hN
∗ = max p(h1,N , φ1,N ). h1,N
(B.14) (B.15)
Identification of the values of the hidden variables that maximize the evidence likelihood can be carried out in the standard manner as in the Viterbi method, namely, by keeping a pointer at each clique along the flow in the forward direction back to the previous clique and then backtracking along this list of pointers from the root clique after the collection phase is complete. An alternative approach is to use the distribute phase of the Dawid algorithm. This has the same effect: Once the distribution flows are completed, each local clique can calculate both the maximum value of the evidence likelihood given the hidden variables and the values of the hidden variables in this maximum that are local to that particular clique. Acknowledgments MIJ gratefully acknowledges discussions with Steffen Lauritzen on the application of the IPF algorithm to UPINs. The research described in this article was carried out in part by the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. References Baum, L. E., and Petrie, T. 1966. Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Stat. 37, 1554–1563. Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W. 1973. Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge, MA. Buntine, W. 1994. Operations for learning with graphical models. Journal of Artificial Intelligence Research 2, 159–225. Buntine, W. In press. A guide to the literature on learning probabilistic networks from data. IEEE Transactions on Knowledge and Data Engineering.
Probabilistic Independence Networks
267
Dawid, A. P. 1992a. Applications of a general propagation algorithm for probabilistic expert systems. Statistics and Computing 2, 25–36. Dawid, A. P. 1992b. Prequential analysis, stochastic complexity, and Bayesian inference (with discussion). In Bayesian Statistics 4, J. M. Bernardo, J. Berger, A. P. Dawid, and A. F. M. Smith, eds., pp. 109–125. Oxford University Press, London. DeGroot, M. 1970. Optimal Statistical Decisions. McGraw-Hill, New York. Dempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38. Elliott, R. J., Aggoun, L., and Moore, J. B. 1995. Hidden Markov Models: Estimation and Control. Springer-Verlag, New York. Frasconi, P., and Bengio, Y. 1994. An EM approach to grammatical inference: Input/output HMMs. In Proceedings of the 12th IAPR Intl. Conf. on Pattern Recognition, pp. 289–294. IEEE Computer Society Press, Los Altimos, CA. Gauvain, J., and Lee, C. 1994. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Sig. Audio Proc. 2, 291–298. Geman, S., and Geman, D. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Patt. Anal. Mach. Intell. 6, 721–741. Ghahramani, Z., and Jordan, M. I. 1996. Factorial hidden Markov models. In Advances in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, eds., pp. 472–478. MIT Press, Cambridge, MA. Heckerman, D., and Geiger, D. 1995. Likelihoods and Priors for Bayesian Networks. MSR-TR-95-54. Microsoft Corporation, Redmond, WA. Heckerman, D., Geiger, D., and Chickering, D. 1995. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 20, 197–243. Hinton, G. E., and Sejnowski, T. J. 1986. Learning and relearning in Boltzmann machines. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, eds., vol. 1, chap. 7. MIT Press, Cambridge, MA. Huang, X. D., Ariki, Y., and Jack, M. A. 1990. Hidden Markov Models for Speech Recognition. Edinburgh University Press, Edinburgh. Isham, V. 1981. An introduction to spatial point processes and Markov random fields. International Statistical Review 49, 21–43. Itzykson, C., and Drouff´e, J-M. 1991. Statistical Field Theory. Cambridge University Press, Cambridge. Jensen, F. V., Lauritzen, S. L., and Olesen, K. G. 1990. Bayesian updating in recursive graphical models by local computations. Computational Statistical Quarterly 4, 269–282. Jiˇrousek, R., and Pˇreuˇcil, S. 1995. On the effective implementation of the iterative proportional fitting procedure. Computational Statistics and Data Analysis 19, 177–189.
268
Padhraic Smyth, David Heckerman, and Michael I. Jordan
Kass, R., Tierney, L., and Kadane, J. 1988. Asymptotics in Bayesian computation. In Bayesian Statistics 3, J. Bernardo, M. DeGroot, D. Lindley, and A. Smith, eds., pp. 261–278. Oxford University Press, Oxford. Kent, R. D., and Minifie, F. D. 1977. Coarticulation in recent speech production models. Journal of Phonetics 5, 115–117. Lauritzen, S. L., and Spiegelhalter, D. J. 1988. Local computations with probabilities on graphical structures and their application to expert systems (with discussion). J. Roy. Statist. Soc. Ser. B. 50, 157–224. Lauritzen, S. L., Dawid, A. P., Larsen, B. N., and Leimer, H. G. 1990. Independence properties of directed Markov fields. Networks 20, 491–505. Lauritzen, S., and Wermuth, N. 1989. Graphical models for associations between variables, some of which are qualitative and some quantitative. Annals of Statistics 17, 31–57. Lindblom, B. 1990. Explaining phonetic variation: A sketch of the H&H theory. In Speech Production and Speech Modeling, W. J. Hardcastle and A. Marchal, eds., pp. 403–440. Kluwer, Dordrecht. Lucke, H. 1995. Bayesian belief networks as a tool for stochastic parsing. Speech Communication 16, 89–118. MacKay, D. J. C. 1992a. Bayesian interpolation. Neural Computation 4, 415–447. MacKay, D. J. C. 1992b. A practical Bayesian framework for backpropagation networks. Neural Computation 4, 448–472. Madigan, D., and Raftery, A. E. 1994. Model selection and accounting for model uncertainty in graphical models using Occam’s window. J. Am. Stat. Assoc. 89, 1535–1546. Modestino, J., and Zhang, J. 1992. A Markov random field model-based approach to image segmentation. IEEE Trans. Patt. Anal. Mach. Int. 14(6), 606– 615. Morgenstern, I., and Binder, K. 1983. Magnetic correlations in two-dimensional spin-glasses. Physical Review B 28, 5216. Neal, R. 1993. Probabilistic inference using Markov chain Monte Carlo methods. CRGTR-93–1. Department of Computer Science, University of Toronto. Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA. Pearl, J., Geiger, D., and Verma, T. 1990. The logic of influence diagrams. In Influence Diagrams, Belief Nets, and Decision Analysis, R. M. Oliver and J. Q. Smith, eds., pp. 67–83. John Wiley, Chichester, UK. Perkell, J. S., Matthies, M. L., Svirsky, M. A., and Jordan, M. I. 1993. Trading relations between tongue-body raising and lip rounding in production of the vowel /u/: A pilot motor equivalence study. Journal of the Acoustical Society of America 93, 2948–2961. Poritz, A. M. 1988. Hidden Markov models: A guided tour. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1:7–13, IEEE Press, New York. Rabiner, L. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257–285.
Probabilistic Independence Networks
269
Raftery, A. 1995. Bayesian model selection in social research (with discussion). In Sociological Methodology, P. Marsden, ed., pp. 111–196. Blackwell, Cambridge, MA. Rissanen, J. 1987. Stochastic complexity (with discussion). Journal of the Royal Statistical Society, Series B 49, 223–239, 253–265. Saul, L. K., and Jordan, M. I. 1995. Boltzmann chains and hidden Markov models. In Advances in Neural Information Processing Systems 7, G. Tesauro, D. S. Touretzky, and T. K. Leen, eds., pp. 435–442. MIT Press, Cambridge, MA. Saul, L. K., and Jordan, M. I. 1996. Exploiting tractable substructures in intractable networks. In Advances in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, eds., pp. 486–492. MIT Press, Cambridge, MA. Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics 6, 461–464. Shachter, R. D., Anderson, S. K., and Szolovits, P. 1994. Global conditioning for probabilistic inference in belief networks. In Proceedings of the Uncertainty in AI Conference 1994, pp. 514–522. Morgan Kaufmann, San Mateo, CA. Spiegelhalter, D. J., Dawid, A. P., Hutchinson, T. A., and Cowell, R. G. 1991. Probabilistic expert systems and graphical modelling: A case study in drug safety. Phil. Trans. R. Soc. Lond. A 337, 387–405. Spiegelhalter, D. J., and Lauritzen, S. L. 1990. Sequential updating of conditional probabilities on directed graphical structures. Networks 20, 579–605. Spirtes, P., and Meek, C. 1995. Learning Bayesian networks with discrete variables from data. In Proceedings of First International Conference on Knowledge Discovery and Data Mining, pp. 294–299. AAAI Press, Menlo Park, CA. Stolorz, P. 1994. Recursive approaches to the statistical physics of lattice proteins. In Proc. 27th Hawaii Intl. Conf. on System Sciences, L. Hunter, ed., 5:316–325. Swendsen, R. H., and Wang, J-S. 1987. Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters 58. Thiesson, B. 1995. Score and information for recursive exponential models with incomplete data. Tech. rep. Institute of Electronic Systems, Aalborg University, Aalborg, Denmark. Vandermeulen, D., Verbeeck, R., Berben, L., Delaere, D., Suetens, P., and Marchal, G. 1994. Continuous voxel classification by stochastic relaxation: Theory and application to MR imaging and MR angiography. Image and Vision Computing 12(9), 559–572. Whittaker, J. 1990. Graphical Models in Applied Multivariate Statistics. John Wiley, Chichester, UK. Williams, C., and Hinton, G. E. 1990. Mean field networks that learn to discriminate temporally distorted strings. In Proc. Connectionist Models Summer School, pp. 18–22. Morgan Kaufmann, San Mateo, CA.
Received February 2, 1996; accepted April 22, 1996.
NOTE
Communicated by Andrew Barto and Michael Jordan
Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational Learning, Massachusetts Institute of Technology, Cambridge, MA 02139 USA
Geoffrey E. Hinton Department of Computer Science, University of Toronto, Toronto M5S 1A4, Canada
We discuss Hinton’s (1989) relative payoff procedure (RPP), a static reinforcement learning algorithm whose foundation is not stochastic gradient ascent. We show circumstances under which applying the RPP is guaranteed to increase the mean return, even though it can make large changes in the values of the parameters. The proof is based on a mapping between the RPP and a form of the expectation-maximization procedure of Dempster, Laird, and Rubin (1977). 1 Introduction Consider a stochastic learning automaton (e.g., Narendra & Thatachar 1989) whose actions y are drawn from a set Y . This could, for instance, be the set of 2n choices over n separate binary decisions. If the automaton maintains a probability distribution p(y|θ ) over these possible actions, where θ is a set of parameters, then its task is to learn values of the parameters θ that maximize the expected payoff: X p(y | θ )E [r | y]. (1.1) ρ(θ) = hE [r | y]ip(y|θ ) = Y
Here, E [r|y] is the expected reward for performing action y. Apart from random search, simulated annealing, and genetic algorithms, almost all the methods with which we are familiar for attempting to choose appropriate θ in such domains are local in the sense that they make small steps, usually in a direction that bears some relation to the gradient. Examples include the Keifer-Wolfowitz procedure (Wasan 1969), the ARP algorithm (Barto & Anandan 1985), and the REINFORCE framework (Williams 1992). There are two reasons to make small steps. One is that there is a noisy estimation problem. Typically the automaton will emit single actions y1 , y2 , . . . according to p(y|θ), will receive single samples of the reward from the distributions p(r|ym ), and will have to average the gradient over these noisy values. The other reason is that θ might have a complicated effect on which actions are chosen, or the relationship between actions and rewards Neural Computation 9, 271–278 (1997)
c 1997 Massachusetts Institute of Technology °
272
Peter Dayan and Geoffrey E. Hinton
might be obscure. For instance, if Y is the set of 2n choices over n separate binary actions a1 , . . . , an , and θ = {p1 , . . . , pn } is the collection of probabilities of choosing ai = 1, so p(y = {a1 ..an } | θ) =
n Y
pai i (1 − pi )1−ai ,
(1.2)
i=1
then the average reward ρ(θ ) depends in a complicated manner on the collection of pi . Gradient ascent in θ would seem the only option because taking large steps might lead to decreases in the average reward. The sampling problem is indeed present, and we will evade it by assuming a large batch size. However, we show that in circumstances such as the n binary action task, it is possible to make large, well-founded changes to the parameters without explicitly estimating the curvature of the space of expected payoffs, by a mapping onto a maximum likelihood probability density estimation problem. In effect, we maximize reward by solving a sequence of probability matching problems, where θ is chosen at each step to match as best it can a fictitious distribution determined by the average rewards experienced on the previous step. Although there can be large changes in θ from one step to the next, we are guaranteed that the average reward is monotonically increasing. The guarantee comes for exactly the same reason as in the expectation-maximization (EM) algorithm (Baum et al. 1970; Dempster et al. 1977) and, as with EM, there can be local optima. The relative payoff procedure (RPP) (Hinton 1989) is a particular reinforcement learning algorithm for the n binary action task with positive r, which makes large moves in the pi . Our proof demonstrates that the RPP is well founded. 2 Theory The RPP operates to improve the parameters p1 , . . . , pn of equation 1.2 in a synchronous manner based on substantial sampling. It suggests updating the probability of choosing ai = 1 to p0i =
hai E [r | y]ip(y|θ ) , hE [r | y]ip(y|θ )
(2.1)
which is the ratio of the mean reward that accrues when ai = 1 to the net mean reward. If all the rewards r are positive, then 0 ≤ p0i ≤ 1. This note proves that when using the RPP, the expected reinforcement increases; that is, hE [r | y]ip(y|θ 0 ) ≥ hE [r | y]ip(y|θ )
where
θ 0 = {p01 , . . . , p0n }.
Using EM for Reinforcement Learning
273
The proof rests on the following observation: Given a current value of θ , if one could arrange that α E [r | y] =
p(y | θ 0 ) p(y | θ)
(2.2)
for some α, then θ 0 would lead to higher average returns than θ. We prove this formally below, but the intuition is that if
E [r | y1 ] > E [r | y2 ]
then
p(y1 | θ ) p(y1 | θ 0 ) > p(y2 | θ 0 ) p(y2 | θ )
so θ 0 will put more weight on y1 than θ does. We therefore pick θ 0 so that p(y|θ 0 ) matches the distribution α E [r|y]p(y|θ ) as much as possible (using a Kullback-Leibler penalty). Note that this target distribution moves with θ. Matching just β E [r|y], something that animals can be observed to do under some circumstances (Gallistel 1990), does not result in maximizing average rewards (Sabes & Jordan 1995). If the rewards r are stochastic, then our method (and the RPP) does not eliminate the need for repeated sampling to work out the mean return. We assume knowledge of E [r|y]. Defining the distribution in equation 2.2 correctly requires E [r|y] > 0. Since maximizing ρ(θ ) and ρ(θ ) + ω has the same consequences for θ, we can add arbitrary constants to the rewards so that they are all positive. However, this can affect the rate of convergence. We now show how an improvement in the expected reinforcement can be guaranteed: log
X E [r | y] ρ(θ 0 ) = log p(y | θ 0 ) ρ(θ) ρ(θ ) y∈Y · X p(y | θ )E [r | y] ¸ p(y | θ 0 ) = log (2.3) ρ(θ ) p(y | θ ) y∈Y X · p(y | θ)E [r | y] ¸ p(y | θ 0 ) ≥ log , by Jensen’s inequality ρ(θ ) p(y | θ ) y∈Y =
where Q(θ, θ 0 ) =
¤ 1 £ Q(θ, θ 0 ) − Q(θ, θ ) . ρ(θ)
X
(2.4)
p(y | θ)E [r | y] log p(y | θ 0 ),
y∈Y
so if Q(θ, θ 0 ) ≥ Q(θ, θ), then ρ(θ 0 ) ≥ ρ(θ ). The normalization step in equation 2.3 creates the matching distribution from equation 2.2. Given θ , if θ 0 is chosen to maximize Q(θ, θ 0 ), then we are guaranteed that Q(θ, θ 0 ) ≥ Q(θ, θ ) and therefore that the average reward is nondecreasing.
274
Peter Dayan and Geoffrey E. Hinton
In the RPP, the new probability p0i for choosing ai = 1 is given by p0i =
hai E [r | y]ip(y|θ ) , hE [r | y]ip(y|θ )
(2.5)
so it is the fraction of the average reward that arrives when ai = 1. Using equation 1.2, 1 ∂Q(θ, θ 0 ) = 0 0 ∂pi pi (1 − p0i ) X X 0 p(y | θ )E [r | y] − pi p(y | θ )E [r | y] . × y∈Y:ai =1
So, if
P p0i
=
y∈Y:ai =1
P
y∈Y
p(y | θ)E [r | y]
p(y | θ)E [r | y]
y∈Y
,
then
∂Q(θ, θ 0 ) = 0, ∂p0i
and it is readily seen that Q(θ, θ 0 ) is maximized. But this condition is just that of equation 2.5. Therefore the RPP is monotonic in the average return. Figure 1 shows the consequence of employing the RPP. Figure 1a shows the case in which n = 2; the two lines and associated points show how p1 and p2 change on successive steps using the RPP. The terminal value p1 = p2 = 0, reached by the left-hand line, is a local optimum. Note that the RPP always changes the parameters in the direction of the gradient of the expected amount of reinforcement (this is generally true) but by a variable amount. Figure 1b compares the RPP with (a deterministic version of) Williams’s (1992) stochastic gradient ascent REINFORCE algorithm for a case with n = 12 and rewards E [r|y] drawn from an exponential distribution. The RPP and REINFORCE were started at the same point; the graph shows the difference between the maximum possible reward and the expected reward after given numbers of iterations. To make a fair comparison between the two algorithms, we chose n small enough that the exact averages in equation 2.1 and (for REINFORCE) the exact gradients, p0i = α
∂ hE [r | y]ip(y|θ ) , ∂pi
could be calculated. Figure 1b shows the (consequently smooth) course of learning for various values of the learning rate. We observe that for both algorithms, the expected return never decreases (as guaranteed for the RPP but not REINFORCE), that the course of learning is not completely smooth— with a large plateau in the middle—and that both algorithms get stuck in
Using EM for Reinforcement Learning
275
local minima. This is a best case for the RPP: only for a small learning rate and consequently slow learning does REINFORCE not get stuck in a worse local minimum. In other cases, there are values of α for which REINFORCE beats the RPP. However, there are no free parameters in the RPP, and it performs well across a variety of such problems. 3 Discussion The analogy to EM can be made quite precise. EM is a maximum likelihood method for probability density estimation for a collection X of observed data where underlying point x ∈ X there can be a hidden variable y ∈ Y . The density has the form X p(y | θ)p(x | y, θ), p(x | θ) = y∈Y
and we seek to choose θ to maximize X £ ¤ log p(x | θ) . x∈X
The E phase of EM calculates the posterior responsibilities p(y|x, θ) for each y ∈ Y for each x: p(y | x, θ ) = P
p(y | θ)p(x | y, θ) . z∈Y p(z | θ )p(x | z, θ)
In our case, there is no x, but the equivalent of this posterior distribution, which comes from equation 2.2, is p(y | θ)E [r | y] . Py (θ) ≡ P z∈Y p(z | θ)r(z) The M phase of EM chooses parameters θ 0 in the light of this posterior distribution to maximize XX p(y | x, θ ) log[p(x, y | θ 0 )] . x∈X y∈Y
In our case this is exactly equivalent to minimizing the Kullback-Leibler divergence ¸ · X p(y | θ 0 ) 0 . Py (θ ) log KL[Py (θ), p(y | θ )] = − Py (θ ) y∈Y Up to some terms that do not affect θ 0 , this is −Q(θ, θ 0 ). The Kullback-Leibler divergence between two distributions is a measure of the distance between them, and therefore minimizing it is a form of probability matching.
276
Peter Dayan and Geoffrey E. Hinton
Figure 1: Performance of the RPP. (a) Adaptation of p1 and p2 using the RPP from two different start points on the given ρ(p1 , p2 ). The points are successive values; the lines are joined for graphical convenience. (b) Comparison of the RPP with Williams’s (1992) REINFORCE for a particular problem with n = 12. See text for comment and details.
Using EM for Reinforcement Learning
277
Our result is weak: the requirement for sampling from the distribution is rather restrictive, and we have not proved anything about the actual rate of convergence. The algorithm performs best (Sutton, personal communication) if the differences between the rewards are of the same order of magnitude as the rewards themselves (as a result of the normalization in equation 2.3). It uses multiplicative comparison rather than the subtractive comparison of Sutton (1984), Williams (1992), and Dayan (1990). The link to the EM algorithm suggests that there may be reinforcement learning algorithms other than the RPP that make large changes to the values of the parameters for which similar guarantees about nondecreasing average rewards can be given. The most interesting extension would be to dynamic programming (Bellman 1957), where techniques for choosing (single-component) actions to optimize return in sequential decision tasks include two algorithms that make large changes on each step: value and policy iteration (Howard 1960). As various people have noted, the latter explicitly involves both estimation (of the value of a policy) and maximization (choosing a new policy in the light of the value of each state under the old one), although its theory is not at all described in terms of density modeling. Acknowledgments Support came from the Natural Sciences and Engineering Research Council and the Canadian Institute for Advanced Research (CIAR). GEH is the Nesbitt-Burns Fellow of the CIAR. We are grateful to Philip Sabes and Mike Jordan for the spur and to Mike Jordan and the referees for comments on an earlier version of this paper. References Barto, A. G., & Anandan, P. (1985). Pattern recognizing stochastic learning automata. IEEE Transactions on Systems, Man and Cybernetics, 15, 360–374. Baum, L. E., Petrie, E., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Stat., 41, 164–171. Bellman, R. E. (1957). Dynamic programming. Princeton, NJ: Princeton University Press. Dayan, P. (1990). Reinforcement comparison. In Proceedings of the 1990 Connectionist Models Summer School, D. S. Touretzky, J. L. Elman, T. J. Sejnowski, & G. E. Hinton (Eds.), (pp. 45–51). San Mateo, CA: Morgan Kaufmann. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Proceedings of the Royal Statistical Society, 1–38. Gallistel, C. R. (1990). The organization of learning. Cambridge, MA: MIT Press. Hinton, G. E. (1989). Connectionist learning procedures. Artificial Intelligence, 40, 185–234.
278
Peter Dayan and Geoffrey E. Hinton
Howard, R. A. (1960). Dynamic programming and Markov processes. Cambridge, MA: MIT Press. Narendra, K. S., & Thatachar, M. A. L. (1989). Learning automata: An introduction. Englewood Cliffs, NJ: Prentice-Hall. Sabes, P. N., & Jordan, M. I. (1995). Reinforcement learning by probability matching. Advances in Neural Information Processing Systems, 8. Cambridge, MA: MIT Press. Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. Unpublished doctoral dissertation, University of Massachusetts, Amherst. Wasan, M. T. (1969). Stochastic approximation. Cambridge University Press, Cambridge. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.
Received October 2, 1995; accepted May 30, 1996.
Fast S i g m o i d a l N e t w o r k s v i a S p i k i n g N e u r o n s Wolfgang Maass InstituteforTheoreticalCompufnSoence, TechnircheUniversitmt Gror, GIOZ,Austria
We show that networks of relatively realistic mathematical models for biological neurons in principle can simulate arbitrary feedfornard sigmoidal neural n e b in away that has previoudv not been considered. This new approach is based on temporal coding b;single spikes (mspectively bv the timine of svnchmnous firin= in ~ o a l of s neurons) rather than on the traditional interpretation of analog variables in terms of firing rates. Ihe resulting new simulation is subrtantially faster and hence more consistent withexperimental resulbabout Ihemuimalspeedof information processing i n cortical neural systems. Asa consequence wecan show that networks of noisy spiking neurons are "universal soomximators" in the sense that thev , can a~oroximate with regard to temporal coding any givencontinuous function of several variables. This result holds for a fairly large class of schemes for coding analog variables by firing times of spiking neurons. This new proposal for the possible organization of computations in nehvorks of spiking neurons systems has some interesting consequences for the type of learning rules that would be needed to explain the selforganization of such networks. Finally, the fast and noise-robust implementation of sigmoidal neural nets by temporal coding points to possible new ways of implementing feedforward and recurrent sigmoidal neural n e b with pulse stream VLSI.
.,
2
..
- .
..
1 Introduction Sigmoidal neural nets are the most powerful and flexible computational model known today. In addition they have the advantage of allowing "selforganiration"via a variety of quitesuccessful Learningalgorithms. Unforhnately thecomputational unitsof sigmoidal neural netsdiffer strongly from biological neurons, and it is particularly dubious whether sigmoidal neural nets provide a useful paradigm for the organization offast computations in cortical neural svstems. lrad~honallyon', vmws the brmg rate of a neuron as the rep-ntatwn 01 an analog vanable m analog computatwns with sp~kmgneurons, ~n part~cular.m thc %~mulahon of s~gmondalneural net5 by spekmg nrumns Neural Computorzon 9 , 2 m (1997)
@ 1997 Ma.rachuselb lnshhlte of Technology
280
Wolfgang Maass
However, with regard tofastcorticalcomputations, thisview isinconsistent with exoerimental data. Perrett el ol. (19821 . . and Tho- and Imbert (19891 . . have demonstrated that visual pattern analysis and pattern classification can be carried out by humans in just 100 ms, in spite of the fact that it involves a minimum of 10 synaptic stages from theretina to the temporal lobe. The same speed of visual processing has been measured by Rolls and others in macaaue monkevs. Furthermore thev have shown that a sin& cortical area involved m vl,ual proce,wn): can complcle !b compulaliun in lust 20 to 70 ms (Rolls 1994. Rolls and Tovee 19Y4) On the other hand, the firmg rates of neurons involved in these computations are usually below 100 Hz. and hence at least 20 to 30 ms would be needed iust to samole the current firing rate of a neuron. Thus a coding of analog variables by firing rates is quite dubious in the context of fast cortical computations. Experimental evidenceaccumulated during the past few years indicates thatmany biologicalneuralsystemsuse the timingofsingleactionpotentials (or "svikes") to encode information (Abeles et al. 1993: Bialek and Rieke 1992, b a r
..
" .
Fast Sigmoidal Networks via Spiking Neurons
281
function (RBF) units, where the weights of the RBF units are stored in the delavs between svnaoses and soma of a s ~ i k i n neuron. e Hence Hoofield's , consmctlonprovtdesaneffrc~entway of mplementrnga look up tablc wrth spnkmg neurons (wlth rome very nice mvanance wgardm): the strength of the -tnmulud Ilowe\er, m contrast lo thcconstmchon consdered hew, hrs system rs b a r d on ' grandmother neurons,' and rt ~ s n ogeared t toward prov~dmganmfomat~veoutputm a situatjon where themput (sl s,) dnot match (UD s it is. . to a factor)one of the fixed set of stored ~ i t t e m(because for example, a superposition of several stored patterns). Furthermore Hopfield's construction provides no method for simulating multilayer neural no compunets. In addition, in contrast to our construction, it tational or learning-related role to the efficacy (i.e. strength) of synapses between biological neurons. We describe in the remainder of this section the orecise models for sie" moadal neural ncts and nosy sprkmng ncunms that we cons~der,and at the end of thrs sectLon describe the key mechantsm of our smulatron Ihe mam construction of this article is eive" in Section 2. and our main result is stated . Section3 weshow that this result in the theoremat the end of that s ~ t i o nIn implies that networks of noisy spiking neurons are universal approximaton. Wealso move that this result holds for a fairlv > lareeclass of schemes for temporal coding of analog variables. In Section 4 we briefly indicate some new perspectives about the organization of learning in biological neural svstems &at follow from this a&roach. ' We point out that this is nocan article about biology but about computational complexity theory. Its main results (given in Sections 2 and 3) are rieorous theoretical results about the com~u&ionalDower of common mathrm~t~cal models for networks of spkmg neurons However, some Informal comment\ have bwn added (after the thcorrm In %dwn 2. as well as in Sections 4 and 5) in order to facilitate a discussion of the bioloeical relevance of this mathematical model and its theoretical consequences. The computational unit of a sigmoidal neural net is a sigmoidal gate (o-gate) G, that assigns to analog input numben XI.. .. x,_l 6 [O. y] an output of the form'= x o ;:( r, .x, + r.). The function o: R + [0, y] is called the activation function of G. n . . . .. ""-1 are the weiehts of G.. and r... is the bias of G. These are conside;ed adjustable paramete; of G in the context of a learning process. The parameter y > 0 determines the scale of the analog -. com~utationscarried out bv the neural net. For convenience we assume that each G-gate G has an additional input x, with some constant value c E (0, yl available. Hence after rescaling r., the function , fc- that is com~utedbv , G can be viewed as a restriction oTthe function
. .,
.
.
-
-
.
.
Wolfgang Maass
282
to arguments with x. = c. The original choice for the activation function a in Rumelhart et ol. (1986) has been the Logistic sigmoid function o(y) = 1/(1 + 0'). Many years of practical experience with sigmoidal neural nets have shown that the exact form of the activation function o is not relevant for the computational power and learning capabilities of such neural nets, as Long as a is nondecreasing and almost everywhere differentiable, the Limits Limy,-, o(y) and limy,,o(y) have finite values, and o increases approximately linearly in some intermediate range. Gradientdescent learning procedures such as backpropagation formally require that o is differentiable everywhere, but practically one can just as well use the piecewise linear "Linear-saturated activation function n,: R + [O, y] defined by n,(y)=
I
0. y. Y
ify
y.
As a model for a spiking neuron we take the common model of a Leaky integrate-and-fire neuron with noise, in the formulation of the somewhat more general spike response model of Gershler and van Hemmen (1994). The onlv specific assumption needed for the consbction in this article is that po&+aphc potent;als can be dexrtbed (or at least approxmated) by a llnear funchon d u m g some rnrhal segment Actually theconstmchons of thtb art& appear to be of ~nteresteven ~f this asumphon 15not sattsfied, but in that case thev are harder to analvze theoreticaliv. We consider networks that consist of a finite set V of spiking neurons, a set E G V x V of svnapses, . . a weizht w . , > 0 and a response function E,. R* + R for each svnaose (u. v ) E E (where RC := Ix E R: x > 01). and , a thmhold funchon (3, R t + R' for each neuron u c V Each responsefunctmnc, ,modelsetther an EPSPor an 1151' lhr typral ahaw of EEPs and lPSPs IS md~caledan F~eure1 F. c R+ is the set of firing times of akeuron u, then the potential at the higger zone of neuron v at time t is given by
.,:
.
if
Furthermore one considers a threshold function Q,(t - t') that quantifies the "reluctance" of v to fire again at time t if its Last previous firing was at time P. Thus (3&) is extremely large for small x and then approaches 0,(0) for larger x. In a noise-& model, a neuron v fires at time t as s w n as P,,(t) reaches @At- P). The precise form of this threshold function (3, is not important for the conshuctions in this article, since we consider here only computations that relv on the timinz of the first spike in a spike hain. Thus it suffices to assume that O,(t - t') = 0,(0) for sufficiently large values o f t - t' and
Fast Sigmoidal Networks via Spiking Neumns
283
Figure I. The lypncal shapeof mhtbalory and exotalory postsynaphc polenlials al a bmlo@calneuron (We assume that the resnng mcmbrane pomennal has thc value 0.)
that mfI(3,W x E (0 YII IS larger than the polenhals P,UJ that occur m the construction of Section 2 lor t t [To.,- y . T ,,] The latter condihon (which amounk to the assum~tionof a sufficientlv lone refracton, oeriod) will prevent iterated firing of neuron u during the critical time interval [Tag - y , Tml. The construction in Section 2 is robust with respect to several types of noise that make the model of a spiking neuron biologically more realistic. As in the model for a leaky integrate-and-fire neuron with noise, we allow that the potentials P,(t) and the threshold functions <&(t- Y) are subject to some additive noise. Hence P,(t) is replaced by
-
PySY(t):= P"(t)
2 .
.
+ ol"(t).
and O,(t - t') is replaced by
where a&) and &(t) describe the impact of some unknown (or even adversarial) source of noise (which might, for example, result from synapse failures). One assumes in most previous theoretical studies that a&). &(t) are distributed accord in^ to some s-ific vrobabilitv distribution (e.e., white . noise), whereas our subsequent constructions allow that a,(t). &(t) are some arbitrary functions with bounded absolute value (e.~., - "systematic . noise"). Ina simpler model for a noisy spiking neuron, one assumes that a neuron v fim exactly at those time points t when P ( t ) reaches from below the value @Yg(t- 1'). We consider in this article a biologically more realistic
-
.
.
-
Wolfgang Maass
284
model, whereas in Gerstner and van Hemmen (1994), the size of the difference P ( t ) - C)yy(t - t') governs just the probability that neuron u fires. The choice of the exact firing t i m e is left up to some unknown stochastic processes, and it may, for example, occur that u does not fire in a time interval I duringwhich e Y ( t ) @ y Y ( t -t') > 0,or thatv fires "spontaneously" at a time t when e Y ( t )- C)Y(t - t') c 0. For thesubsequent constructionswe need only the following assumption about the firing mechanism: For any time interval I of length greater than 0, the probability that v fires during I is arbitrarily close to 1 if P'Yy(t) 0YY(t t') is sufficiently large for t E I (up to the time when u fires), and the probabilitythatv firesduring !isarbitrarily c l o s e t o ~ o f ~ ~ ( t ) - @ y y ( t - t ' ) is sufficiently negative for all t E I. It turns out that it suffices to assume only the following rather weak propertiesof the response functionsc,, ,,:Eachresponse functionE,,,: Rf R is either excitatory or inhibitory. All excitatory response functions e.,,(t) have thevalueOfor t E 10, d,,,l,and thevalue t d , . for t E Id,,, d,,+Al, where d,,, ? 0 is some fixed delay and A > 0 is some other constant. Furthennore we assume that E ,,, (t) ? c..,(d ,,., A) for all t E [d,., A.d,, + A y], where y with 0 c y 5 A12 is another constant. With regard to inhibitory response functions E. .(t), we assume that ~,,,,(t)= 0 D - du.d fort E Id,,.". du." A]. Furthermore fort E 10, d,.,] and E ~ , ~=( -(t we assume that E,, .(t) = 0 for all sufficiently Large t. Finally we need a mechanism for increasing the firing threshold C) := @do)of a "rested" neuron v (at Least for a short ~eriod). One bioloeicallv plaus!bleabsumptmn that would account for buchan ulcreasets that neuron u m e w o b a large number of iPSh from randomly fmngnruruna that a r r w c onsynapses that are far away from the triggerzoneof vTso that eachof them has barely any effect on the dynamics of the potential at the trigger zone, but together they contributea rather steady negative summand BN- to the potential at the trigger zone. Other possible explanations for the increase of the firine threshold (3 could be based on the contribution of inhibitow mterneurons whose IPSPsarr~vecloseto lhesoma and are t m c locked tothe onset of the stimulus, or on long-lasl~ng - ~nhnbittonssuch as those mcdmted by GABAs receptors. Formally we assume that each neuron v receives some negative (i.e., inhibitory) potential BN- c 0 that can be assumed to be constant during the time intervals that are considered in the followine areuments. In comparison with other models for spiking neurons, this model allows more general noise than the models considered in Gerstner and van Hemmen (1994) and Maass (1995). On the other hand this model is somewhat less general than the one cons~deredIn Maass (199ha) H a v q deftntd the formal modcl, wr can now cxplam the key mechanism of ihe constructions in more detail. It is well known that i&oming
-
+
+
+
+
-
--
.
Fast Sigmoidal Networks via Spiking Neurons
285
EPSPs and I E P s are able to shiff the firing time of a biological neuron. We explore this effect in the mathematical model of a spiking neuron, showing that in principle it can be used to carry out complex analog computations in temporal coding. Assume that a spiking neuron u receives PSPs from presynaptic neurons a,. .. . , a n , that w, is the weight (efficacy) for the synapse from a, to u, and that d; is the time delay from a, to v. Then there exists a range of values for the parameters where the firing time L of neuron v can be written in terms of the firing times t,, of the presynaptic neurons a, as
Hence in principle a spiking neuron is able to compute in temporal coding of inoutsand oubutsalinear function (where theefficaciesof svnaosesencode , the coeff~cients of the h e a r function, as in rate codnng of analog varmbles) Thecalculations at the beztnntnz of Sectnon 2 show that thrs holds pmrwly are at time t,, all in their initial liniarlv rik if there is no noise and then P S P ~ ing or linearly decreasing phase. However, for a biological interpretation, it is interesting to know that even if the firing times to, (or more precisely their effectivevaluest., + d i ) lie further apart, this mechanism computes a meaningful approximation to a linear function. It employs (through the natural shaoeof BPS) an interesting adaotation of outliersamonp. the t, +d;: Input neuronsa, that fire t w late (relative to the average) losetheir influenceon the determination off,, and input neurons a, that fire extremely early have the same imoact as neuronsa, ihat firesomewhat later (but stilibefori the average) Remark 2 ~nScctnon 2 provtdesa m o r e d e t a h i dtscuss~onof thtsrffecl Thegoal of the next srctton IS to provca ngorous theoret~ralresult about theco&utational oower of formal models foinetworks of soikine neurons. We are not claiming that this construction (which is designed exclusively for that purpose) provides a blueprint for the organization of fast analog comoutations in bioloeical neural svstems. However.. it aorovides the first theoretical model that is able to explain the possibility of fast analog computations with noisy spiking neurons. Some remarks about the possible biological relevance of details of this construction can be found after the theorem in Section 2.
.
" .
.
-
2 The Main Construction
Consider an arbitrary +-gate G, for some y > 0, which computes a function fc: [O. Y ] " --t 10, Y ] . Let rl. . . . r. E R be the weights of G. Thus we have rl . s, < 0 if
.
fc(s1,.
.. . s n ) =
I1
E:=,r, .s,,
.
for arbitrary inputs sl.. . . s. E [O. y ] .
r:=l
if 0 5 if
EL, I# .
r, .s, 5 y S,
r y
Wolfgang Maass
286
Figure 2: The simulation of a sigmoidal gate by a spiking neuron v in temporal coding
-
.
.
For the sake of simdicitv * we first consider the case of mikine " neurons without noise (i.e., u, = 8, 0 and each neuron v fires whenever Po(!) crosses Q,(t - P ) from below). Then wedescribe thechan~es that areneeded in this construction for the general case of noisy spikingneurons. We construct for a given n,-gate G and for an arbitrary given parameter E > 0 with E c y a network Nc.* of spiking neurons that approximates fc with precision 5 E ; that is, the output NG.,(sI.. .. ,s,,) of NG., satisfies INC.&I.. .. .sn) - fc(s1.. .. .s.)I j E for all sl. . . . s. E [O, y ] . In order to be able to scale the size of weights according to the given gate G.we assume that h% ,receives an additional inout sn that is eiven like the other input variables sl.. . . s. in temporal coding. Thus we assume that there are n 1 input neurons ao.. .. o, with the property that a, fires at time Ti. - s, (where T.. is some constant). We will discuss at the end of this sectio"(in Remarks 5 a n d 6) biologicaliy more plausible variations of the construction where ao and T , are not needed. that receives n+l PSPs ho(t),.. , Weconshucta spiking neuron v in NcGG h.(t) from the n 1input neuronsao.. .. an, which result from the firing of a, at time T,. - s, (see Fimre 2). In addition v receives some auxiliarv PSPs from other spiking neu& in N c ,, whose timing depends only on
.
.
+
+
. -
.
.
-
.
Fast Sigmoidal Networks via Spiking Neurons
287
The firing time t, of this neuron v will p r o v i d .. . ,~).s of the network NG.,in temporal coding; that is, u will fire at time T, NG.&I.. . . s.) for some Toutthat does not depend on SI.. . . ,s.. Let w.,,, be the weight of the synapse from input neuron a, to neuron u, i = 0,... ,n. We assume that the "delav" , do,.-,,behveen a, and u is the same for all input neurons ao.. . . an, and we write d for this common delay. Thus we can describe for i = 0... . n the impact of the firing of a; at time T,. - s, on
.
.
. --
the potential at the trigger zonebf neuron u at t h e t bv the EPSP or IPSP h,(t) = w.,,,. ~ , . ~-((T,,, t - s,)), which has on the basis of our assumptions the value if t - (T>,,- s,) < d h i ( t ) = ( ~ E . ( t - ( T , . - s , ) - d ) , ifdjt-(T,.-s;)jd+A. in the case of an EPSP and w, = -w.,,, in the case of an where w, = w.,,, IPSl? Weassume that neuron v hasnot fired for a sufficientlv , lone- time.. so that its thmhold function B,(t - Y) can be assumed to have a constant value W when the PSPs ho(t). . . . h.(t) arrive at the trigger zone of u. Furthermore weassume for themoment thatbesides these n j i PSPsonly BN- influences the potential at the trigger zone of u. This contribution BN- is assumed to have a constant value in the time internal considered here. Then if no noise is present, the time t, of the next firing of u can be described by the equality
.
h,(t,)
(3 = ;=a
+ BN-
w, . (t, - (T,. - s;)
=
- d) + BN-.
(2.1)
'4
provided that
We assume from now on that a fixed value so input so. Then equation 2.1 is equivalent to
This t, satisfies equation 2.2 if -s, hence for any sj E [0, y ] if
.
j 1,
= 0 is chosen for the extra
- T,. - d
j
.
A
- s, for j = 0,. . . . n;
We set w, := A . r, for i = 1 , . . . n, where rl, .. . r, are the weights of the simulated =,-gate G, and A > 0 is some not-yet-determined factor (that we
288
Wolfgang Maass
will later choose sufficiently large in order to make sure that neuron v fires closely to the time 1, given by equation 2.3 even in the presence of noise). We choose wo so that x:=o wi = A. This impl~esthat equation 2.3 with To,,,:=
<-) - BN-
+ T," + d
is now equivalent to t" = T m -
C h .EL. ,=I
and equation 2.4 is equivalent to 05--
O-Y-
g7,s,jA-y.
(2.6)
Hence, provided that y , A, and BN- are chosen in relationship to 6) and A SO that (3 - B N' ~ 5 -
4A-Y.
(2.7)
we havesatisfied equatian2.4, and thereforeachieved that the firing time t, of neuron uprovides ln temporal coding theoutput f c s . . . ,s = Z:=, r,. s,ofthesimulatedn,-gateCforallinputsst.....s, E [0, y]withx:=, r,.s, E 10. YI. ' In order to simulate G also for sl. . . . s,, E 10, y] with other values of x:=, r, - s,, we make sure that u fires at a time with distance at most E to To., - y if x:',r, . s, is larger than y and that v f i r e with distance at mast E to time T,, if E:=, r, . s, c 0. For that purpose, we add inhibitory and excitatory neurons that fire at times that depend on time T,, but not on the inputs,. . . . s,. Theactivity of theseauxiliary inhibitolyand excitatory neurons may also shift the firing time t, of u if x:=l r, . s, E [O, y], but at most by E. According to the previously described construction, we have by equation 2.5 that t., = T,...+ if r. = 0 for i = l . . ..n (and therefore T!-. -,-, r . . s. = 0). Furthermore the parametera have been rhown so that equation 2 2 rs satls. f l e d for thts caw, whch ~mplws thnt each of the PSPs h , O ) 15 at time T.,,for E, = Ostill within theinitialsegment of lenath A of its first nonzeroseament. This implies that for any value of s, E LO. G] the PSP h;(f) is at time T;, - y not further advanced than the end of the initial segment of length A of its
.
.
.
Fast Sigmoidal Networks via Spiklng Neurons
289
for all t 5 To., - y and all s, E [O, y]. This implies that for any st.. . . .s. E 10. y1 (even if x:=, r1. 4 d [O. Y]) we have Ix:loh,(t)l 5 W . A for all t 5 To, - y, where W := x:l=o Iw,l. Hence in order to prevent a firing of u before time To,,, - y, it suffices to employ auxiliary neurons in NG, that send IPSPs to u that lower the potential at the trigger zone of v during the interval [To,,,- y - A. To,,, - y] by more than W . A BN- - (-). In general, these lPSPs also influence the potential at the higger zone of u shortly after time To,, - y. Since these auxiliary lPSPs will be independent of the input a,. . . ,s,, they may delay the firing of u even when T:', r, . s, E 10. . yl. . In order to aooroximate the gwen n,-gate G w,ith an ermr of at most F , wc a9sume that t h ~ aux~lmry e IPSPs havcvanlbhedat the lrrggrr zoneof uby tmeT,,., - y T > It I S O ~ V ~ U I that all t h c r cond~tronscan be achwed wrth a sufftcwnt number (and/or weiehtsl of auxiliarv IPSPs from inhibitorv neurons in N P . ,svnchronized , whose firing time depends only on T,,,. Thiscan be satisfied using only the very weak condition that each IPSP is continuous and of value 0 before it evenhlallv vanishes (see the omise condition in Section 1). We now want tomakrwre that u fire+at the latest by ttme T,,., - y + r ~f by tmeT,,, - y r r . r c ,y r S,ncvallauxtltaryII'Sl'shave~~an~shed ie-have P,,(T,,,, - y &) = x:=oh,(~,.l - y E) B ~ - . . ~ e n cite suffices to show that the latter is 2 (-1. Consider the set I of those i E (I.. . . n) with r, > 0, that is, those i where w,,,, 8,,, ( t ) represents an EPSP. By choosing suitable values s: in the interval [O.s,l for i E I and by settings: := s, for i E ( I . . .. n) - I , one can achieve that EL, r; . s: = y - E. According to equation 2.5, the potential P,, reaches the value 6) at time To,,,- y + E for the inout (s:. and 2.7. each PSP . . , . . s:,), ,,. and accordine to eauations 2.2.2.4.2.6. w,.,, .E.,,, is at time T,,,,,- Y +& still within A of the beginning of its nonzero phase. If we now changes: tos, for i t I, the EPSPw,,., .e,,.,, will be advanced time by s, - s: E [o.;]. Hence each of the EPSPs w.,,, . F.,,, for i E I is for inputs, at time To,,,- y + e within A + y of the beginning of its rising phase. Since we have assumed in Section 1 that w.,,, . e,,,,(t) z w.,,,,. E, (d + A) for all t E Id A, d A vl, . . P,,(T,,,+ . . -v . cl has for inout h. . .. . s,,).. a value that is at least as large as for input (s;. . . . .s,), and therefore a value 3 (3. This implies that in the case E:=, . r, . s, > y - e the neuron u will fire within the time interval [T,,,, - y , T,,,, - y + E]. In an analogous manner we can achieve with the help of EPSPs from auxiliary excitatory neurons in NcI (whose firing time depends only on T,,, not on the input valuesst.. . . ,s,,)that neuron v fires at the latest at time r, . S, < 0. The preceding analysis implies that for any To,,,,even if values, E 10, yl and any t 5 To,,,,the absolute value of the PSP w,,,, . E.,.,.(11 ,,(t)I: i E 10.. . . nl and can be bounded bv Iw.. z.I - 0. where 0 := SUDIIE,, t E Id, d A y.1) is'the maximum absolite vhue that any e.,,,,(t) for i = 0.. .. ncanreachduringtheinitialsegmentoflengthA+y ofitsnonzero segment. Thus if the absolute value of thhse functions e., ,St) grows during [d+A. d+A+y] not faster thanduring their linearsegment [ d , d+A] (which
+
..
" .
-.
~
+
r:-,
.
+
. ..
+
+ +
.
- .
+ +
+
.
.
r:=,
+ + .
.
Wolfgang Maass
290
Figure 3: The resulting activation function n of the implementation of a sigmoidal gate in temporal coding.
apparently holds far biological PSPs), we can set p := A + y.Consequently we can derive for any I 5 T,,,, and any values 51. . . . s, t [O,yl the bound I x:loh,tt)l 5 Iw,.,.~ - p = W - p. Thus it suffices to make sure that EPSPs from auxiliary neurons in NG,,reach the trigger zone of neuron u shortlv after time T,,., - e and that their sum reaches a value C ) - B N + W. o by t~meT~,,.,.Then for any valuesof $ 8 . . s, E 10.y l thepotent~alP,( 1 , will reach the value <-)atthelalest by tmeT,,,.cau,mga brmgof v by t m e To,., Thesc auxrhary EPSPs w l l ~ngeneral h a w the r ~ d effcct c that they shghtly r i,t 10.r 1 the fmng t ~ m c advance for mputs st.. . s,, c 10.y 1 w t h t,., but the fnrmg t m e wdl m y whhm the mterval Ir,,., - P , r,.,] Accordxng lo our absurnptron m Sectron 1. infl(-I,(xl x E (0,yll is so large that v cannot fire more than once during [To,,,- y,To,,]. Hence our - y.To,,]for construction makes sure that u fires exactly once during [Tout any inputs sl.. . . s, E LO. yl. In the following discussion we will denote . . s,). this firing time t,. of u by To,,,- NG.,(~I,. We have now shown that under theassum~tionthat no noise influences the firing time I, of neuron u, we have lNc,(& . . . s,) - f c h . .... $,,)I 5 e for all s1. . ... s,, E 10.yl.If one plots the variable y = Nc.,(sl. ...s,) that neuronuoutDutsintem~oralcadineasa functionof Y!L, r . . s,.onesees that .. the effective "activation function" o of this implementation of t h e n - ate G has the form indicated in Figure 3. Like the piecewise linear acti
.
x:=,
z:,
.
.
.
-,-.
sate
Fast Sigmoidal Networks via Spiking Neurons
291
v will be arbitrarily close to the previously considered deterministic firing time t, of neuron v, with probability arbitrarily close to 1. Assume that some arbitray &.s > 0 are given. In the preceding construction,we had chosen values w, := A. r, for the strengths of the synapses to neuron v, where A > 0 was some not-yet-determined factor. According to equation 2.3, a change in the value of this parameter A can result only in a shift of t, by an amount that is independent h.om the input variables $1,. . . s.. Furthermore if one chooses the contribution BN- of inhibitory background noise as a function of A so that Q - EN- = A. Z for some constant ?, the resulting firing time t, is completely independent of the choice of A. On the other hand, the parameter A occurs as a factor in the non. r, . la.. . d t - (T,. -$))I = zy=oh,(t) in the potential constant term h,(t) +EN- (we ignore the effectof auxilary PSPs for the moP&) = ment). Hence by chwsing A sufficiently large, one can make sure that in the deterministic case h ( t ) has an arbitrarily large derivative at the time t, when it crosses O,(f - t'). Furthermore if we choose y. A, and BN- so that equation 2.7 can be replaced by
.
x:=,xy=,A
then it is guaranteed that the linearly increasing potential Pdt) (with slope proportional to A) will rise with the same slope throughout the interval Itn - y , t, E]. Ifwenow k e e thissettine ~ of themrameters but reolacethedeterministic neuron u by a stochastrc neumn u that rs subject to the two types of nolse that werespeofd mSect~on1. thefollowmgcan beobserved If o(t)l 5 o and 16(t)1 . -2.6 for all L then the time intervaiaround t.,- durine which Pdf) . . is within the interval [O - ol - 0. O + a + ,4] becomes arbitrarily small for sufficiently large A. Furthermore if A is sufficiently large (and EN- is adjusted along with A so that = Z for some constant ? z 0 that is independent of A), then durine IT.,,, - v. R ( t ) + a + 6 - (9 is arbitrarilv , neeative ~, . t,.. - el.. Thus the pmhah~htythatvferesdur~ng[T,.,-y, I, -r]can behroughtarbttrardy close to 0 In addrt~on,by making A sufficwntly large, one can achwvr that P,.(t)a - 6 - O is arbit&lv larie throunhoit thetime interval It. . . + s12, . t,, + 81. ~ e n c the e probabiliGtha& fires b; time t, + E can be brought arbitraril; close to 1. If one increases the number or the weights of the previously described auxilary PSPs in an analogous manner, one can achieve for the noisy neuron fires model that with probability 2 1 - S the neuron v of network NG,,~ exactly~nceduring[T~-y, TTT,l,andatatime;" = Tout-N~.r.a(s~. . . . .a,l) with l N ~ . ~ . s (. .s.~. a. n ) - fc(s1,. ...snl 4 2e, f o r a l l s ~.. . . .s t [O, yl.
+
-
.
-
.,.
.
292
Wolfgang Maass
The previously described construction can easily be adapted to allow shifts in the arrival times of auxiliary PSPs at v of u p to &/2.Hence we can allow that these PSPs also come from noisv soikine neurons. In order to simulate an arbitrary given feedforward network N of n,gates with precision e a n d probability? I - S of correctness, oneapplies the e i v & ~ .6 > 0 &xedineconstruction sevaratelv,toeach eate G in N. For anv, ~, one detcrminn for each n, -gate (; m N (startmg at the output gates of N) su~tablevaluesr,..6r, ,0,aothat turpproumatt,N wrthmt w~thprobabiltly 2_ withii d ~ E with C - 1 - 6, it suffices that each gate G i n N is a ~ ~ r o x i m a t e probability E 1-' 6 by a networkNc -8, of noisy spiking neurons. In this ~ spiking neurons that way, one can achieve that the n e t w ~ r k N ~ ,of, , noisy is composed of these networks Nchlc,sGapproximates in temporal coding with probability 2_ 1 - S the output of N within c, for any given network input X I . . . . x,~,E [O. yl. Actually it would be more efficient if the modules N G . , ~ are . ~ not disjoint but share the neurons that generate the auxiliary EffiPs and IPSPs. Note that the oroblem of achievine reliable dieital and analog romputatmn w t h nwsy neurons ha, already brrn conrldrwd m von Neumann (1956) for other t) pes of t o m a l neuron models Thus we have shown:
,.
-
.
..
.
Theorem.Foranygiuenp.S > Oonecansimulateanygivenfeedforwardsigmoidd neuralnetNconsistingofn,-gates(forsomesyqiomtlysmaNy thatdependson the chosen model for a spiking neuron) by a n e t w o r k N ~s. ~ of noisy spikmg neurons in temporal coding. More precrsely, for any network input XI. . .. x., E [0,y ] the output of NN,,,~ differs with probability z 1 - 6 by at most &from that of N. Furthermore the computation time ofNq, s depends on nerther the number of gates in N nor theparameters s, S, but only on the number oflayers ofthesigmoidal neural network N.
.
Remark 1.Onecan exploit the concreteshape of PSPs in biological neurons in order to arrive at an alternative approximation of sigmoidal neural n e b bvspikinr! neurons in tem~oralcodinawtthoutsimulatinr! ex~licitlvtheactivation function n, of the sigmo~dalneural net. In this alternative approach, one can delete from the previously described construction the auxiliary neurons that prevent a firing of neuron v before time To,,,- y and force it to fire Furthermore in this interpretation we no longer at the latest by time Tun,. have toassume that the initial linear segmentsof all incoming PSPsoverlap, and hence less precision in the firing times is needed. In case v fires substantiallv after time T,,,,,. the resultine PSP at a ~ o s t s y n a p t ~neunm u' ( a h x h smulatrs a stgmuddl gate un the next layer of the srmulated s~gmoldalneural net) stdl has 11stntt~alvalue 0 at the t,me t, whrn z3 f ~ r e sDually, 7,fnrrsbefore tune T,,,, - y . t h e r r d t t n ) : PSPmay be at trme I,. already near tls saturatmn value (where 11 mcreases or dccreast% mon.slowly than d u r q 11s~nnr~al "lmear phase") Thus In e ~ t h ecr a s . the concrete functronal form of EPSPs and IPSPs tn a bmlog~ialnrurun u' n d -
..
- .
.
Fast Sigmoidal Nehvorks via Spiking Neurons
293
. , .
ulates the inout from a oresvnaotic neuron u in a wav that corresoonds to the application of a saturating sigmoidal activation function to the output of v in temporal coding. Furthermore. largerdifferences than y between the firine timk of oresvn&tic neurons u &be tolerated in this context This implicit implementationof an activation functionis, however, mathematically imprecise. A closer look shows that the amount by which the o u t ~ uofv t in temooral coding depends on thesize of the ournut - is adiusted , of u relative to the size of the outputs of the other preynaptic neurons of v' (in temporal coding). The output of u is changed by a Larger amount if it differs more strongly from the median output of the other presynaptic neurons of u' (but it is always moved in the direction of the median). This context-dependentmodulation of the output of v is hard toexploit for precise theoretical results, but it appears to be useful for practically relevant computations such as pattern recognition. Thus we arrive in this way at a variation of the implementation of a sigmoidal neural net, where two biologically dubiouscomponents of our main conshmction (the auxilian, neurons that force u to fire exactlv once durine IT,.,, - y . ' I : , , , ) , as well as the rcqummrnl that the initml h e a r segments of all relevant E P s have to overlap) are deleted, bur the result~ngnetwork of spiking neurons can still carry out complex and practical multilayer computations in temporal coding.
.
, .
-
Remark 2. Our constructionshows as a soecial case that a linear function of g ?for real-valued inputs - ( 5 , . . . s,,) and a stored vector the form 1 can be computed very efflc~ently(and bery qurckly) by a g = (w. .am,,) spiking neuron v in temporaimding. In this case;no auxi~i~ry~neu&s are needed. Quick computation of linear functions is relevant in many biological contexts, such as coordinate transformations between different frames of reference or the analysis of a complex stimulus g in terms of many stored patterns g.g'.. . . For example, in an olfactory neural system (see, e.g., Hopfield 1991, 1995) the stimulus 2 may be thought of as a superposition of various stored basic odors w. w'. . . . . In this case the oubut v = w - s of neuron u in temporal codingmay be interpreted as the an&& byw6ch the basx odor w (which is stored in the efficacis of the synapses of u ) is present in the stimulus 2. Furthermore another neuron on the next layer might recewe as its input y = (y. y'. . . .) from several such neurons u, v'. . . . : that is, B receives the "n&ing proportions" y = g - & d = g' g for various stored basic odors g,g',. . . in temporal coding. This neuron B on the second layer can then continue the patternanalysis by computing for its input y the inner product W .y with some stored "higher-order pattern"H (e.g.,the compositionof basic odorscharacteristic for an individual animal). Such multilayer pattern analysis is facilitated by the fact that the neurons considered hereencode their output in the same way in which their input is
.
.
Wolfgang Maass
294
encoded (in contrast to the approach in Hopfield 1995).One also gets in this way a very fast implementation of Linsker's network (Linsker 1988) with spiking neurons in temporal coding. Remark 3. For bioloaical neurons it is imvossible to choose the varameter A arbiharily large.0n'ihe other hand, the&periments of ~ r ~ a n t a Segundo nd (1976). as well as the recent experiments of Mainen and Sejnowski (1995). s u a e s t that bioloaical neurons already exhibit very hizh precision in their fir& times if the slope of the membrane potential',(6 ai the time t when it crosses the firing threshold is moderately large. Remark 4. Analog computations with the type of temporal coding considered here become impossible if the jitter in the firing times is large relative to the size of the range [0, y ] of analog variables in temporal coding. The following points should be taken into account in this context:
.
Even if the jitter is so high that in temporal coding just two different outputs can be distinguished in a reliable manner, the computational power of the constructed network N of spiking neurons is still enormous from the point of view of computational complexity theory. In this case, the network can simulate arbihary threshold circuits (is., multilaver DerceDhOns whose eates eive binarv o u b u t s ~verv fast. Threshold crrcults are cxlrvmely powerful models for parallel dlgrtal computntmn, whrh ran (m contrast to PRAM, and other models for cur&tly available computer hardware) compute various nonhivial boolean functions that depend on a large number n of input bits with polynomially in n many gates and not more than four layers (Johnson 1990; Roychowdhury et al. 1994; Siu et al. 1995).
, . .
~. ~.
3
.
.
Formallv the value of v in the med dine construction was reouired to be very small: it was bounded by a fraction of the length A of the rising segment of an EPSP (see inequalities2.7and 2.8). However, keep in mind that we were forced to resort to such small values for v oniy bccaur we wanted toprovea rlgorous theoretrcal rrsult.'lhecons~deratwns In Remark 1suggest that in a prurttrulconte~t.mecan still carry out meaningful comp&tions if the linearly rising segments of incomine EPSP aresvread out over a somewhat lareer time interval than A. where they need no longer overlap. If one wants to compute specific values for the parameters y and A, one runs into the problem that the value of A va& enormously amona different biolo&l neural svstems, from about 1 to 3 ms fir EPSPS-resulting from ~ M P A mep& in cortex to about a second in dark-adapted toad photoreceptors
.
There exists an alternative interpretation of our construction in a biological neural system that fmuses on a smaller spatial scale. In this interpretation, a "hot spot" in thedendritic t m of a biological neuron
Fast S~gmoidal Networks via Spiking Neurons
295
assumes the role of a spiking neuron in our construction. Hot spots are patches of membrane with voltage-dependent channels that are known to occur at branching points of the dendritic tree (Shepherd 1994; Jaffe et al. 1992; Mel 1993; Softky 1994). They fire a dendritic spike if the membrane potential reaches a certain threshold value. From this point of view, a single biological neuron may be viewed as a network of spiking neurons, which according to our construction, can simulate in temporal coding a multilayer sigmoidal neural net. The timing precision of such circuits is possibly very high since it does not involve synapses (see also Softky 1994). Furthermore in this interpretation the computation is less affected by possible failures of synapses. In another biological interpretation one can replace each neuron v in
NN,6 by a pool P,. of neurons (as in a synfire chain; see Abeles et al 1993).The firing time t,, of neuron v is then replaced by the mean firing time of neurons in P,,, and an EPSP from u is replaced by a sum of EPSPs from neurons in PC,.This interpretation has the advantage that it is less affectedbv, iitter in the firine times of individual neurons and by stochast~cfadures of ~nd~vrdual synaprs Furthcrmorc, the rjbml: begmen1 of a ,urn of EKPs from P, IS hkoly to be subslanttally lonwr than that of an individual EPSP. Hence in this interpretation. the parameter y can be chosen substantiallylarger than for singleneurons.
.
Remark 5. The auxiliaryneuronan that provides in theconstruction a reference spike at time T,, is not really necessary Without such auxiliaryneuron ao, theweights w, of the invutss; are auto&icallv normalized soihat the" sum up to 1 (see equation 2.3 with x:=o W, replaced by x:=, w,). Such normalization is disadvantageous for simulating an arbitrary given sigmoidal gate G,but it may be desirable in a biologicai context. Remark 6. Our construction was based on a specific temvoral codine in terms of reference ltmes T,,, and T,,,. Some bdogsal \ystem\ may pn* vide such rcfcrencctmcs that are trme locked w~ththe onset of a scnsory stimulus. However, our construction can also be adapted to other types of tem~oralcodine that reauire no reference times. For example.. the basic equation at the end of Section 1-uation 1.1-also yields a mechanism for carrying out analog computations with regard to the scheme of "competitive temvoral coding" discussed in Thome and lmbert (1989). . . In this i o d q %heme the frrrng time 1, of neumn a, encdes the analog varrable (1, + d , ~ min,;] ,,(re . 7 4 .1 , whered,. = d,,, , ISthedelay behveen neuron a, and v. For 7 := minl=l. ..(to, +dl) we have according to equation 1.1, "
A
.
296
Wolfgang Maars
-
In this coding xheme the firing time 4. encodes the analog value 1,. minesl t i , where L is the layer of neurons to which u belongs. Thus, equation 2.9 provndes the mwhanism for computing a segment of a linear function for inputs and outputs in competitive temporal coding. Hence our construction also pmvides a method for simulating multilayer neural nets with regard to thisalternativecodingschemeNoreference times T,,, or T,,,,, are needed for that
by spiking neurons with competitive temporal coding, one can add lateral excntation among all neumns on the same layer L. In this way the neumns in L are forced to fire within a rather small time wlndow (correspondq to the bounded output range of a slgmoidal activation function). In other applications, it appears to be more advantageous to employ instead lateral inhibition. This also has the effm of preventing the value of 1,. - mmiSr 13 from becoming t w large.
Remark 7. For the sake of simplicity we have considered ~n the p d ing constructmn only feedfonvsrd computations on networks of spiking neurons. However, thesamesimulation method can be used tosimulate recurrentsigmoidalneural nehby murrentnehvorksofspikingneurons For example, in this way onegetsa novel implementationof a Hop$ddnet (with synchronous updates). At each ''round'' of the computation, the output of a unit of the Hopfield net is encoded by the finng time of a corresponding sptksng neuron relatwe to that of other ncurons m the network. For example, one may assume that a neuron gives the highest possible output value !fit 15 amone the first neurons that fire at that round. Each soikine neuron simulates in temporal coding a sigmoidal unit whose inputs are the firing timer of the other spiking neumns in the network at the previous mund One can employ here competitive temporal coding (we Remark 6); hence no reference tnmw or external clock are needed. A stablestateof theHopfield net is reached in this implementation (with competitive temporal codmg) nf and only tf all neumns fire "regularly" (i.e., at regular intervals with a common intenpike internal) but generally with dnfferentphases. Thwe phase encode the output values of the individual neurons,and together these phaw differencesrepresent a "malled" stored pattern of the Hopfield net Thus, each stored pattern of the Hopfield net is realized by a different assignment of phase differences in some stable oscdlation. This novel implementation of a Hopfield net with spiking neurons m temporalcodmg has,apart from its hlghcomputatlonspeed,anotherfeature that is possibly of biological interest: whereas the input to such network of spiking neuroncan be transient (encoded ~nthe relative t i m q of the firing of each neuron in thenetworkat the first round), itsoutput isavailableover
-~
297
Fast Lgmoidal Networks v m Spxkmg Neurons
a longer tune period, since it is encoded in the phases of the neurons in a
stable global oscillation of the network. Hence, this implementation makes it possible that even in the rapidly fluctuating environment of temporal coding with single sptkes, the outputs from different neural subsystems (which may operate at different time wales) can be collected and integrated by a larger neural system.
3 Universal Approximation Properties of Networks of Noisy Spiking Neurons in Temporal Coding
One of the most interesting and useful features of sigmoidal neural nets N s the fact that nf [O. yl is the range of them activation functions, they can approximate for any gwen natural numbers n and k any given continuous function F from 10. yl" into 10. y l h i t h i n inany given E > 0 (with regard to unnform convergence, is., the L, norm). Furthermore it sufficestoconsider for this purpose feedforward nets N with just one hidden layer of neurons and (roughly) any activation function o that is not a polynomial (Leshno t t al. 1993) In addition, many years of experiments with backpropagation andother learning rulesfor sigmoidal neuralnets haveshown that for most concrete application problems, sigmoidal neural nets wlth relatively few hidden units allow a satisfactory approximation of the underlying target function F. The result from the preceding section allows us to transfer these results to networksof spiking neurons with temporal coding. If some feedforward rigmoidal neural net N approximates an arb~trarygwen cont~nuousfunc[O. ylk within an e (with regard to the L, norm), then tion F: 10. yl" of splklng neurons that we with probabslity ? 1 - 6 the network constructed in k t i o n 2 approximates the same F within 28 (with regard to the L, norm). Furthermore, if N has only a small number p of layers, the computation time o f N ~ can be bounded (for biologically reasonable choices of the parameters involved) by 10. y ms. Thus if one neglects the fact that the fan-in of bmlogncal neurons s bounded bv some fixed lalthoueh the owedine" the. " rather lareelconstant. ,, . oretlcal results sueeest "" that networks of bioloelcal neurons can 11" . mite , of lhwr " r h w n e d ) apprdxtmaw arbtlrary ronlmu~urlunclwns F 10, y l" 10. v]' wnlhtn any gnen I wtlh a compulatam ttme 01 not more than 20 rn, hnally. wr would hkr to p r n t out thal o u r apprormmalum result hold, no1 only lor the parttcular way of enr.odmg analog Inputs and outpul, by firing timw that was consided ~nthe previoussection but basscally for m y coding that iscontmuously related toit. Moreprecisely,let n. iI. k. k bearbi10.11" be any continuous function trary natural numbers Let Q: 10. yl" that specifiesa method of "decoding" 1 variables ranging over [O. I] from the firing times of n neurons dur~ngsometune window of length y (whore
-
.
-
-
Wolfgang Maass
298
-
end point is marked by the firing time of the first one of these n neurons), [O. ylk be any continuous and invertible function that and let R: [O. lji describes a method for "encodine" outout variables raneine " " over 10.11 bv, the firing times of k neurons durmg some time window of length y (whose end point is marked by the firing time of the first one of these k neurons). [O, 1li thecomposition Then for any givencontinuous function f : I0.11" F. R o f o ~ o these f three functions isa continuous function from 10, y ] " into [O. ylk . Accordtngtoour precedingargument, thereexistsa networkfl,.s of noisyspikingneurons(withone"hiddenlayer")such that forany? E [O. y]" one has for theoutputNF.s(x)of this networkN?,a that II F@)-N,.s&) II 5 F with probability ? 1 - 8 , where II . II can be any common norm. Hencea,.! approximates for arbitrary inputs the given function f : [ 0 , 1 ] b [O.Ilk for arbitrarily chosencontinuous functions R Q for codingand decoding of analogvariablesby firing timesofspikingneurons witha precisionofat least sup (I1R-'(y) - R-'(f) 11: y. y' [0.1Ik and II y - f l l E ) . Thus, if the inverse R-' of the function R is uniformly continuous, one can approximate P with regard to neural coding and decoding described by R and Q with arbitrarily high precision by networks of noisy spiking neurons with just one hidden layer.
"
.
.,.
4 Consequences for Learning
In the traditional intemretatlon of (unsupervised) Hebbian learninn, a synapw r\ atrengthenrjlf both the pmsynaptic and the portsynapttc &uronsares~multaneously"actwe" (re ,both give hrgh output values ~nterms of the~r current frrtngrates) In thc implementat~onof a sramordal neural net N bv a network of &kine neurons NW .... ,"i in Section 2. ;he "weiehts" r, of N are in fact modeled by the "shengths" w, of corresponding synapsez between spiking neurons. However, the information whether both the presvnaptic a i d p&tsynaptic neurons give high output values in temporal coding can no longer be read off from their "activity" but only from the time difference T, between their firing tunes. This observation eives rise to the auestion of whether there are biological mechanisms known that support a modulation of the efficacy (i.e. "strength) w, of a synapse as a function of this time difference T,. If one worksin the linear r a n k of the simulation Nr. of a n.-eate , " G accordine IoStvtron 2 (whereC;computes the funct~on(s, 5 , ) and h,ttl describes an tl'S1: then for Ilebbmn learnma tt would bc des~cablcto increasew, = A . r, if si is close to E:=, r, . s,; that is, if the differencein firing r,.s,-T,,+~,isclosetoT,,,~ -T ,,,. On timesT, := t,,-(T,.-5,) = theother hand, one would like to decrease" w, if T, issubstantially smaller
.
-
.
"
-
. . .
,,,, z:=, ,,
Fast Sigmoidal Networks via Spiking Neurons
299
or larger than Tour- T,.. Hence, a Hebbianstyle unsupervised learning rule of the form
(for suitable parameters 6, p > 0)would be meaningful in this context. Recent results from neurobiolow (Stuart and Sakmann 1994)show that action potentials in neocortical pyramidal cells are actively (i.e., supported by voltage-dependent channels) propagated backward from the soma into the dendrites (see also laffe el al. 1992). Hence the time difference T , between the firing of the presynaptic and the postsynaptic neurons is in principle available to each synapse. Furthermore new experimental results (Markram 1995; Markram and Sakmann 1995) show that in vitro. the efficacy of synapses of neocortical pyramidal neurons is in fact modulated as a function of this time difference T;. There exists one interestine structural differencebetween this intemretation of Hebbian learning in temporal coding and its traditional interpretation: the time difference T, provides a synapse with the full information about the correlation between the output ;alies of the pre- and postsynaptic neurons in temporal coding, no matter whether both neurons give high or low output values. However, in the traditional interpretation of Hebbian learning in terms of firing rates, the efficacy of a synapse is increased only if both neurons give high output values (in frequency coding). An implementation of Hebbian learning in the temporal domain is also appealing in the context of pulse s w m VLSl (i.e., "silicon spiking neurons"). These artificial neural nets are much faster than biological spiking neurons: thev can work with intermike intervals in the microsecond ranee. If for a hardware implementation of a slgmoidal gate with pulse stream VLSl according to the construction of Section 2, a Hebbian learning rule can be applied in i h e temporal domain after each pulse, such a chip may be able to carry out so many learning steps per second that it could in principle (neglecting input-output constraints) overcome the main impediment of traditional artificial neural nets: their low learnine meeds. So far we have assumed in the construction of N N . ~ in ,Section ~ 2 that the time delays d, between the presynaptic neurons a; and the postsynaptic neuron u (i.e., the time needed until an action potential from a; can influence the potential at the trigger zone of v ) were the same for all neurons a,. Differences amonn these delays have the effect of ~ r o v i d i n e an additive corwtloni, rt r, s d , tothevariablr that isrommunicaled in temporal codIn): I n m a , I O U llenre, they also have Ihrability togivedifferent "we~ghts" todifferent inout variables. In a bioloeical context.. jhev, aowar to be useful for providing to the network a priori information about its computational
-.
-
".
..
Wolfgang Maass
300
task, so that Hebbian learning can be viewed as "hne-tuning" on top of this vrevrorrrammed information. If there exist bioloeical mechanisms for modulaiin&uchdelays(see Hopfield 1995; ~ e m ~ t e r h1996). l . they would provide in addition toa short-term memory via synaptic modulationa separate mechanism for staring and adapting long-term memory via differences in the delays. 5 Conclusions
We have shown in this article that there exists a rather simple way to compute linear functions and to simulate feedforward as well as recurrent sigkoidal neural nets in temporal coding - bv , networks of noisv spiking neimns. In contrast to the traditionally considered implementation via frequency coding, the new approach yields a computation speed that is faster by several orders of magnitude. In fact, to the best of our knowledge, it pmvides the first theoretical model that is able to explain the experimentally observed speed of fast informat~onpmcesing in the cortex on the basis of relatively slow spiking neurons as computational units. Further experimentswill be needed todetermine whether this theoretical model is biologically relevant. One problem is that we do not know in which way batch inputs (consisting of many analog variables in parallel) are encoded by biological neural systems. The existing results on neural codine (see Rieke el al. 19961 address onlv the codine of time serie. that 13. wqurnt~al analog mpul, However. tf further ekperrmenrz -howrd that the mput-dependent fmng trmes nn vtsual cortex, as reporld In Bau el a! . . (1994). varv in a continuous (i.e, piecewise continuous) manner in resvonse to smooth change, of comple* mputs, t h ~ would s pro\ tde some support to rr nth sptkmy, thc4yleof thvnn.t~calmodel5 foranalogromputat~on . . . .neurons that is considered here. HovFurthermore. a bioloeical realization of recurrent neural n e b ke.. <, f~eldnets) m temporal codrng wrrh sptkmfi neurons, as pmposed m Kemark 7 ~n Serlnln 2, would pmdm the a r u r r m c c of oacdldtwna m neural svstems (eoeciallv involved in vattern momition and workine * , svstemi , memory), where the phase of individual neurons (or of pools of neurons) with regard to this oscillation is input dependent. A noteworthy feature of the constructions presented here is that the "weights" (efficacies)of synapses are in principle able to play in temporal coding the same role as in the more familiar context o f f i r m ~rate codinx. Hence, all theories and experimental results regarding adaptation of neural circuits via synaptic plasticity can in principle also be applied to such computations in temporal coding. Furthermore, a closer Iwk shows that the networks that we have constructed for analoe comvutations in temvoral coding can comvute the same analog functron on a different ttme scale m/lrmg rate iod,ng Th,; poba~ble dun1 role of the ctrcutt..of sp~kmgneurnns zonstructed here IS of mrarrst ~n
,.
-
.
..
- .
Fast Sigmo~dalNetworks via Spiking Neurons
301
the context of biology. There exists empirical evidence that some neural systems carry out in addition to a very fast preliminary computation a more thorough subsequent computation in terms of firing rates, which has the abilitv to inteerate relevant contributions from several neural subsvstems whose com~utationaloreaniaatlon can be understood-as an imolementa-
of view of computational complexity theory and engineering applications. In addition, feedforward and recurrent sigmoidal neural nets are the only parallel computat~onalmodels known that support self-organization (and hence do not requirecumbersome and error-pronecentral "programming") Hence if the "hardware" of biological neural systems allows m principle an efficient implementation of sigmoidal neural nets, it is not unreasonable to assume that this possibihty has been realized by at least some biological neural systems (in spite of our lack of knowledge about the concrete functions that they compute). Apart from their biological interpretation, the construction of this article appears to be of interest also in the context of hardware implementations of neural nets via ~ulse-streamVLSI (see. e.e. Pratt 1989, Horinchi et al. IWI, M u r r q and laresscnko 1994. Jahnkc rt al. 1995) ,\ itmulatton 01 a ferdhrward slgmo~dnlneural net hy pulw itnem VLSI a h ) : the Ime+ of this constructi& offers the oassihil& of combinine extremeiv , hieh " cornputatlon speed (simulating one parallel computation step of the sigmoldal neural net per pulse) with a noise-robust and fast transmission of internal analoe ;ariables via the relative timine of ~ u l s e from s different eates. memory. We have shown in this article that networks of spiking neurons wlth tem~oralcodiie have at least the same computational power as simnoidal neural nets of roughly the same size and depth. Together with a recent separation result (Maass 1996b) this implies that nehvorks of spiking neurons have in fact strictly more computati~nalpower. Acknowledgments I thank Martin Amndt, Davld Haussler, Geoffrey Hinton, John Hopfield, Berthold Ruf, and Gordon Shepherd for stimulating discussions related to the tomc of this article. 1also thank the anonvmous referees for their h e l ~ f u l comments.
Wolfgang Maass
302
References Abeles, M., Bergman, H., Margalit, E., and Vaadia, E. 1993.Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. I. Neurophysiology 70,
.-.
1&7(1-1 L.111 . u ,
Bair, W., Koch, C.. Newsome, W. and Britten, K. 1994. Reliable temooral modulation in cortical soike trains in the awake monkev. In Pmc. Svmoosium on < , Dynamics of Neural Processing, Washington, DC. Bialek, W., and Rieke, F. 1992. Reliability and information transmission in spiking neurons. Trends in Neuroscience 15,428-434. ~ r ~ a ; tH. , L., and Segundo, J. P. 1976. I. Physlolog!, 264,279-314. Ferster, D., and Spruston, N. 1995. Cracking the neuronal code. Science 270,756 757 . -..
Gemher, W., and van Hemmen, J. L. 1994. How to describe neuronal activity: Spikes, rates, or assemblies? In Advnnces in Neural Information Processing Systems 6, pp. 463-470. Morgan Kaufmann, San Mateo, CA. Heller, J., Hertz, J. A,, Kjar, T. W., and Richmond, B. J. 1995. Information flow and temporal coding in primate pattern vision. 1. Computational Neurorctence 2(3), 175-193.
Hopfield, J. J. 1991.Olfactory computationand object perception. Proc. Nat. Acod. Sci. USA 88.64624466. Hopfoeld.I I 19% Pattern recugnhtron computatmn usmg acttun potential trmIng for stimulus representattons Nature 376.33-36 H0nnchr.T. Lazmro, l ,Moore, A andKach C 1991 Adelay-lmc based motron detection chip. In Advancesin Neural Information Processin; Systems3, pp. 6 412. Morgan Kaufmann, San Mateo, CA. Jaffe, D. 8.. Johnston, D., Laser-Ross, N., Lisman,J. E., Miyakawa, H.,andRoss, W. N. 1992. The spread of Nat spikes determines the pattern of dendritic Ca2+entry into hippacampal neurons. Nature 357,24&246. Jahnke, A,, Roth, U., and Klar, H. 1995. Towards efficient hardware for spikeprocessing neural networks. In Pmc. of the World Congresson Neural Networks, Washington, DC. Johnson, D. S. 1990. A catalog of complexity classes. In Handbook of Theoretical Computer Soence, J. V. Leeuwen, ed., vol. A, pp. 67-162. MIT Press, Cambridee, MA. Kempter, R,Gerstner. W . van Hemmen. I. L.. and Wagner, H. 1996. Temporal codmgm thcsub-mdlwecond range. Modelof barnowl audrtary pathway In Advances In Neural Informalton Procr~sm~ 8, . pp . Systems . . 124-1M MI1 I'ress. Cambridge, MA. Leshno, M., Lin, V. Y., Pinkus, A,, and Schocken, S. 1993. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks 6 . 8 6 1 4 7 . Linsker, R. 1988. Self-organization in a perceptual network. Computer Magazine 21.10M17.
Fast Sigmo~dalNehvorks via Spiking Neurons
303
Maass, W. 1995. On the computational power of networks of spiking neurons. In Advances in Neural Informatton Pmcessing Systems 7,pp. 183-190. MIT Press, Cambridge, MA. Maass, W. 1996a. On the computational power of noisy spiking neurons. In Advances in Neurol Information Processing Systems 8, pp. 211-217. MIT Press, Cambridge, MA. Maass, W. 1996h. Networks of spiking neurons: The third generation of neural nehvork models. To anwar in Neuaral Networks. Malnen. Z t,and !xlnowskl, T 1 1W5 Keltabrl~tyof sptke trmmg ~nneocortncal nrurunb S~tmt<268150F150h Markram H 1 9 5 Nmcortral pyrnm~dalneurons walc thccfftcacy. of ,ynaptnc . . input according to arrival time: A proposed selection principle of the most appropriate synaptic information. In Pmc. of Cortical Dynamtcs, pp. 10-11, lerusalem. Markram, 11, and Sakmann. U 1995 A c t m putent~al,pmpagatmg back anto d m d n t o tr~ggcrc h a n g s m rffncary of m g l r axon synapses bclween layer V pyramrdal neurons In Pror of lbp Confpreu e of tit* Sorwty fur hrurmrwncr.
..
.-.. --.
.,,d
,1
Mel, B. W. 1993.Synapticintegration inanexcitabledendritic tree. I. Neumphysrology 70,10861101. Murray, A., and Tarassenko, L. 1994. Analogwe Neural VLSI: A Pulse Stream Approach. Chapman & Hall, London Perren D. I., Rolls, E. T , and Cam, W. C. 1982. Visual neurons responsive to faces in the monkey temporal cortex. Experimental Brain Research 47,329-342. Pran, G. A. 1989. Pulse Computation. PhD. thesis, MIT. Rieke, F., Warland, D , Ruyter van Steveninck, R., and Bialek, W 1996. SPIKES: Exploring the Neural Code. MIT Press, Cambridge, MA. Rolls, E.T. 1994.Brain mechanisms for tnvariantvisual recognition and learning. Behauiaural Processes 33,113-138. Rolls, E. T., and Tovee, M. 1. 1994. Processing speed in the cerebral cortex, and the neurophysiology of visual backward masking. Proc. Roy Soc. B 257,9-15. Roychowdhury, V, Siu, K., and Orlitzky, A,, eds. 1994. Theoretical Aduances in Neurol Computation and Learning. Kluwer, Boston. Rumelhart, D. E., Hinton, G. E., and Williams, R. 1.1986. Learning repmentalions by hack-propagating errors. Nature 323.53S536. Segundo.1. P. 1994. Noiseand theneurosciences: A long history, a recent revival, and some theory. In Origins: Bmn and SelfOrganirntion, K. H. Pribram, ed., p p -331. Lawrence Erlbaum Associates, Hillsdale, NJ. Sejnowski, T. I. 1995. lime for a new neural code? Nature 376.21-22. Shepherd, G. M. 1994. Neumbiology. 3d ed. Oxford University Press, New York. Siu. K., Roychowdhury, V., and Kalaith, T. 1995. Discrete Neural Computatmn. Prentice Hall, Englewood Cliffs, NI. Softky,W. 1994. Sub-millisecond coincidence detection in actwe dendritic trees. Neumscience 58. 1 3 4 1 Stuart.G.1.. and S a k m a ~B. , 1994. Active propagation of somaticaction patentials into newortical pyrimidal cell densities. Nature 367.69-7'2.
3M
Wolfgang Maasr
Thorpe, S. J.,and Imbert, M. 1989. Biological constraints on connectionist modelling. In Collnrclronirn! tn Pcrrplzur, R. Pfeifer, 2. Schreter, F FogelmanSouli6, and L Steels, eds ,pp 63-92. Elrevw, New York von Neurnann, J. 1956. Probabil~rt!~ l o g m and the synthesis of reliable organisms fmm unreliable components. In Automala Sladies, C. E. Shannon and J. MacCarthy, edr Prmceton Univers~tyPress, Princeton, NJ.
Communicated by William Softky
Computing with the Leaky Integrate-and-Fire Neuron: Logarithmic Computation and Multiplication Doron Tal Eric L. Schwartz Department of Cognitive and Neural Systems, Boston University, Boston, MA 02215 USA
The leaky integrate-and-fire (LIF) model of neuronal spiking (Stein 1967) provides an analytically tractable formalism of neuronal firing rate in terms of a neuron’s membrane time constant, threshold, and refractory period. LIF neurons have mainly been used to model physiologically realistic spike trains, but little application of the LIF model appears to have been made in explicitly computational contexts. In this article, we show that the transfer function of a LIF neuron provides, over a wideparameter range, a compressive nonlinearity sufficiently close to that of the logarithm so that LIF neurons can be used to multiply neural signals by mere addition of their outputs yielding the logarithm of the product. A simulation of the LIF multiplier shows that under a wide choice of parameters, a LIF neuron can log-multiply its inputs to within a 5% relative error. 1 Introduction In his book on Mach bands, Ratliff (1965, p. 129) remarks that “often the relation between external stimulus and neural response is approximately logarithmic.” As shown in Figure 1, such a relation is also observed in simulations based on the Hodgkin-Huxley equations (Hodgkin and Huxley 1952) where a step current of varying amplitudes is applied to a single neuron. A similar compressive response is seen, for example, in M-cells (of the magnocellular stream) of visual cortex. In psychophysics, logarithmic neural transduction has been hypothesized to account for Weber’s law (Cornsweet 1970; Land 1977). In the neural modeling literature, a logarithmic transfer function is also commonly hypothesized; for example, Koch and Poggio (1992) describe a mechanism for multiplying with neurons whose transfer function is logarithmic, and Yeshurun and Schwartz (1989) show that an estimate of the power spectrum of the log power spectrum, called the cepstrum in signal processing, provides a simple model for estimating stereo disparity when applied to a columnar image data structure such as the ocular dominance column system of primate visual cortex. These references to logarithmic functionality in neurons lack a generic biophysical Neural Computation 9, 305–318 (1997)
c 1997 Massachusetts Institute of Technology °
306
Doron Tal and Eric L. Schwartz
Figure 1: (a) Current-frequency curve resulting from uniform depolarization of a membrane modeled by the Hodgkin-Huxley equations, using the giant squid axon parameters (Hodgkin and Huxley 1952). (b) Semilog plot of the curve in (a).
justification for this computation on the single cell level. In this article, the leaky integrate-and-fire (LIF) model is shown to provide such a justification. Specifically, we show that the ratio of refractory period (t0 ) duration to membrane time constant (τ ) controls the degree of compressiveness of a neuron’s transfer function: a cell with small t0 /τ has a quasi-linear transfer function, larger t0 /τ gives a quasi-logarithmic transfer function, and yet larger t0 /τ values yield a transfer function more compressive than a logarithm. The article concludes with a discussion of precision and dynamic range of LIF and other single-cell computational models. Although the LIF model used in the simulations is a highly lumped model, its few parameters correspond to measurable physical quantities. In order to demonstrate how the LIF model’s parameters affect its transfer function, the simplest (four) parameter LIF model is used, ignoring spiketrain variability and population codes. (See Softky and Koch 1993 and Softky 1995 on spike-train variability. Knight 1972 and Usher et al. 1993 discuss the improved temporal dynamics of population coding.) The work of Bugmann (1991) also demonstrates two computational regimes—multiplicative and linear—for an LIF neuron, but it uses a different LIF model from the fourparameter model below. For multiplication, a coincidence detector (Srinivasan and Bernard 1976) is used and the LIF neurons include a model of both spike-train variability and the relative refractory period. Despite the difficulties with the coincidence detection multiplier discussed below, Bugmann’s simulations indirectly demonstrate the same effect of t0 /τ on the cell’s transfer function that is shown below. This suggests that simple integrate-and-fire properties of a cell, such as its refractory period duration and the membrane time constant, continue to affect the cell’s transfer
Computing with the Leaky Integrate-and-Fire Neuron
307
function, even in the presence of spike-train variability and of a relative refractory period model. It is worth noting that logarithmic computation in the models below is not presumed to occur in sensory transduction neurons because of the complicated machinery associated with such neurons. Rather, the models suggest that quasi-logarithmic computations could occur in interneurons of the central nervous system. For this reason, parameter values are based on physiological measurements in cat and rat pyramidal cells (see Section 3.1). 2 Properties of the LIF Transfer Function Agin (1964) appears to have been the first to relate Hodgkin-Huxley simulations of a uniformly polarized membrane to “the logarithmic law of sensory physiology” by fitting the curve in Figure 1a to the equation f = 27 log (I + 1), where f is firing frequency and I is applied current. Although the Hodgkin-Huxley model is sufficient to produce a loglike transfer function, as is evident in Figure 1, its complex dynamics are not necessary for explaining the nature of the compressiveness. The much simpler LIF model is all that is required to account for the transfer function shown in Figure 1. As shown below, a neuron’s membrane resistance and capacitance and the duration of its absolute refractory period are sufficient to account for the loglike transfer function shown in Figure 1, without a need to model the detailed dynamics of action potential generation. 2.1 The LIF Model. In the LIF model, a steady current source, I, uniformly depolarizes the membrane, causing an exponential increase in its potential, V. When the membrane potential reaches a threshold, VTh , a spike occurs (at this point we shall assume the spike to take an infinitesimal amount of time). At the spike onset, the membrane capacitance discharges instantaneously, and the membrane potential is reset to the resting potential. This firing process repeats for as long as the input current is on (see Fig. 2b). A resistor and a capacitor in parallel model the membrane’s charging process (see Fig. 2a). The RC integrator of the LIF model relates membrane potential to input current via the usual exponential charging relationship (Horowitz and Hill, 1989): V = IR[1 − e−t/τ ] ,
(2.1)
where τ = 1/RC is the membrane time constant. We now outline Stein’s analysis of the leaky integrate-and-fire model. The time tisi (the time it takes the membrane to reach VTh ), also called the interspike interval, can be calculated by setting V = VTh , t = tisi in equation 2.1
308
Doron Tal and Eric L. Schwartz
Figure 2: (a) RC circuit used to model charging of a neuron’s membrane from its resting potential to VTh . C corresponds to the membrane capacitance, R corresponds to the membrane resistance, and I corresponds to an excitatory input such as a current injection or a steady current produced by a train of excitatory postsynaptic potentials. t0 is the duration of the absolute refractory period. (b) Membrane voltage versus time in an RC circuit LIF model, in response to a step current. Only the suprathreshold spikes are produced in the output.
and solving for tisi : µ
tisi
VTh = −τ ln 1 − IR
¶ .
(2.2)
The rheobasic current, Irh , is defined as the smallest value of current that can drive the membrane potential to VTh : Irh =
VTh . R
(2.3)
Computing with the Leaky Integrate-and-Fire Neuron
309
Substituting equation 2.3 for I in equation 2.2, tisi can be rewritten as ¶ µ Irh . (2.4) tisi = −τ ln 1 − I The cell’s firing frequency f is the reciprocal of the period of each action potential. Since an action potential is considered to have infinitesimally small duration, the period of each action potential is just the interspike interval (from equation 2.4). The firing frequency as a function of current is therefore f =
1 . −τ ln(1 − Irh /I)
(2.5)
The LIF model described so far does not produce a compressive or quasilogarithmic current frequency relation. As current is increased in equation 2.5, the current frequency relation becomes linear. Stein has shown that equation 2.5 becomes linear for large I by expansion of its power series: f =
1 1 2 2 (Irh /I)
−τ [(Irh /I) + + 13 (Irh /I)3 + · · ·] · ¸ 1 1 1 (I/Irh ) − − (I/Irh )−1 − · · · . = τ 2 12
(2.6)
From equation 2.6, we see that as I grows larger than IRh , the frequency becomes proportional to the line with slope RC intersecting the abscissa at 1/2. The inverse frequency, or rate, is thus quasi-linear in the approximation considered here, which ignores the effect of refractory period(s) (see Fig. 3). The quasi-linear output of the simple LIF model therefore does not provide a compressive transduction function. However, the analysis of the effects of refractory behavior (both relative and absolute refractory period), which was included in the LIF analysis of Stein (1967), changes the output function of the LIF model from linear to quasi-logarithmic, in suitable parameter ranges, as follows: Let the absolute refractory period take t0 time1 ; the interspike interval in equation 2.2 is then tisi + t0 , and the frequency is f =
1 1 . = t0 + tisi t0 − τ ln(1 − Irh /I)
(2.7)
1 Note that the relative refractory period—the time period after a spike during which it is difficult but not impossible to generate another action potential—is not taken into account in this model, but Stein (1967) has shown that the effect of including a relative refractory period modeled by rising and falling exponentials is qualitatively the same as that of increasing the value of parameter t0 in the current model; that is, as either the absolute refractory period or the relative refractory period is increased, the current frequency relation becomes more compressive. The parameter t0 can therefore be thought of as lumping the effects of both the absolute and relative refractory periods.
310
Doron Tal and Eric L. Schwartz
Figure 3: Plot of the integrate-and-fire transfer function (solid line) with no absolute refractory period (equation 2.5 or equation 2.7 with t0 = 0). This transfer function approaches the line I/τ − 0.5τ (I is shown in units of the rheobasic current). Thus if the time constant of a neuron is much larger than the refractory period duration, the transfer function of the cell is quasi-linear. τ is 0.007; all other parameter values are shown in Table 1.
Table 1: Parameter Values Used in LIF Multiplication and in Figures 3 and 4. Parameter
Value
Units
t0 τ Vth C R Irh
0.002 varies 0.015 6 · 10−11 τ/C, varies Vth /R, varies
S S V F Ä A
t0 = absolute refractory period; τ = time constant (t0 /τ is varied in the simulations by keeping t0 constant and varying τ ); Vth = membrane threshold voltage; C = soma capacitance; R = soma membrane resistance; Irh = rheobasic current.
Computing with the Leaky Integrate-and-Fire Neuron
311
Note that as I grows arbitrarily, f in equation 2.5 grows without bound, while in equation 2.7, f remains bounded at 1/t0 . This boundedness compresses high I values and yields a log-like response. The larger t0 is in equation 2.7, the more f is compressed. The current frequency relation expressed by equation 2.7 maintains the same shape when the ratio t0 /τ remains constant. Figure 4a shows plots of equation 2.7 for different ratios t0 /τ . Of the curves in Figure 4b, we see that the one whose corresponding t0 /τ is between 0.1 and 0.2 best approximates a log, since it produces the most linear curve on a semilog plot. One critical issue is the goodness of fit of the LIF transfer function to the logarithm. We have taken the approach of defining goodness of fit in terms of the operation of multiplication. In other words, if the compressive nonlinear transduction function supports an accurate multiplication model, then we deem it to be a good quasi-logarithmic function. In the following section, an LIF neuron with biologically reasonable parameter values is quantitatively shown to multiply its inputs to within a 5% relative error. In summary, we conclude that the log-like response of neurons depends on two properties of the LIF model: 1. The neuron’s time constant (affecting integrative behavior). 2. The existence of a refractory period (affecting rate saturation). We can now see why the Hodgkin-Huxley equations produce a log-like behavior, according to Agin’s original observation. The underlying passive circuit in the Hodgkin-Huxley model is an RC circuit with an active component—voltage-dependent conductances—that corresponds to the LIF model’s firing mechanism. Although it is difficult to make general statements about the detailed behavior of the Hodgkin-Huxley dynamics (due to its mathematical complexity), it appears that the logarithmic behavior Agin (1964) observed, and replicated in Figure 1, is accounted for by the passive integrative component alone; it is captured by the LIF model. In any event, the LIF model is sufficient to account for Agin’s observation of the logarithmic behavior of the Hodgkin-Huxley model. 3 LIF Multiplication Many arguments have been proposed for the usefulness of multiplication in the nervous system. Functionally the product of two signals produces an analog measure of their correlation or, in boolean terms, their conjunction, implemented by the AND operator. Barlow (1969) argues that selectivity and generalization in the visual system require multiplication and summation operations. The Reichardt motion detector (Reichardt et al. 1989) relies on a multiplicative operation. (For a review of neural uses and neural models of multiplication see Koch and Poggio 1992.) A simple scheme for neural multiplication mentioned by Koch and Poggio (1992) involves neurons with
312
Doron Tal and Eric L. Schwartz
Figure 4: (a) Plots of equation 2.7 for different values of t0 /τ (key shows t0 /τ ). (b) Semilog plot of the curves in panel (a). Parameter values are shown in Table 1.
Computing with the Leaky Integrate-and-Fire Neuron
313
a logarithmic transfer function that excite or inhibit each other additively or subtractively. Under these assumptions, if x1 is the input to neuron A and x2 is the input to neuron B and the outputs of both neurons A and B are added, then the sum is log (x1 ) + log (x2 ) = log (x1 x2 ),
(3.1)
which is a multiplicative relation. Thus, multiplication and division can be implemented by adding and subtracting outputs of neurons with a logarithmic transfer function. In order to evaluate to what extent, and over what parameter ranges, the transfer function of a neuron is logarithmic, the multiplication model of equation 3.1 was simulated using a single LIF neuron receiving two simultaneous current injections arriving from two LIF cells. A log product provides the functionality of multiplication or correlation simply because it is a monotonic function of the product. The analysis below uses the multiplication model to demonstrate that a quasi-logarithmic computation in neurons can occur accurately across a biologically reasonable t0 /τ range. 3.1 Parameter Choices. Since the interspike interval does not vary in this model and inputs are held constant through time, the equilibrium solution (see equation 2.7) is used. To solve equation 2.7, four parameters need to be specified: t0 , τ , Vth , and C. Parameter choices are based on physiological values observed in cortical pyramidal cells of the rat and the cat, summarized and derived in Bugmann (1991). The input injection current, Iinj , varies over a realistic range of current injection: [1Irh , 13Irh ] amperes, with threshold potential, Vth = 15 mV (see Wolfson et al. 1989). Vth lies between the axon hillock threshold and the soma fast sodium conductance threshold (Schwindt and Crill 1982). The capacitance, C = 6 · 10−11 , agrees with published values of the neural membrane capacitance (1-2µF cm−2 ) (Jack 1979; Kawato et al. 1984) and surface area of the soma of pyramidal neurons (200–700µ m2 ) (Hillman 1979). The refractory period, t0 = 2 ms, is comparable to that used in other biologically motivated simulations of LIF neurons (see Harmon 1961; Amit et al. 1990). Table 1 summarizes the parameter values and derived constants in the model. 3.2 Precision and Dynamic Range. The firing rate of neurons typically extends from a few hertz to a few hundred hertz. The utility of any numerical function dependent on firing rate is thus limited to this range. We have examined the precision of a multiplier that is constructed from two summed quasi-logarithmic LIF neurons. We used the LIF transfer function (cf. equation 2.7) for a range of different t0 /τ (as plotted in Figure 4) and defined the error by “multiplying” several thousand combinations of input currents. This was done by summing the corresponding “quasi-log” firing rates using the LIF curve and comparing this method of multiplication with
314
Doron Tal and Eric L. Schwartz
the correct numerical answer. We defined error, δ, by the average relative error, that is, as δ = 1/N
N X |(L(ai bi ) − ai bi )| i=1
ai bi
,
(3.2)
where ai and bi are two “input” currents to be multiplied. The estimate of their product, ai bi , is obtained by summation of the “quasi-logarithmic” LIF transfer function as follows: Let fLIF (I) be the frequency-current relation expressed by equation 2.7. We first generated a number of random current value pairs {ai , bi }. Multiplication of each pair was performed by the equation ¡ ¢ fLIF −1 fLIF (a) + fLIF (b) ,
(3.3)
−1 where fLIF is equation 2.7 and fLIF is its inverse. Using the inverse function permits the unit of the multiplier (current squared) to be the same unit as the actual product of currents, ai bi , and thus the degree of correlation between the two products can be determined. The products obtained by equation 3.3 are then fitted using linear interpolation to the numerically correct products, ai bi , obtaining an equation for a straight line, L(I).2 A new set of LIF products is then generated, and the LIF products are interpolated via L(ai bi ) and then subtracted from ai bi , as in equation 3.2. Equation 3.2 provides a measure of the average deviation from a perfect multiplication, as a fraction of the product itself. Figure 5a shows the average relative error of the LIF products as a function of t0 /τ . This error, expressed as a percentage, is around 5% for 0.13 < t0 /τ < 0.23, indicating that in this input range, the “quasi-logarithmic” LIF transfer function is useful as an approximate four-bit multiplier. Figure 5b shows the data points used for the error estimate for the case of t0 /τ = 0.2 in Figure 5a. The timing constraints of the LIF model above can be considered by measuring the time to equilibrium of the LIF differential equation as it is iterated in time. As shown in Figure 6, the voltage of a LIF neuron reaches equilibrium (that is, the membrane capacitance saturates) after about 25 ms, across a range of biologically reasonable injection current levels. This means that the multiplication model above can work only when the input to a cell
2 Note that this linear regression is used simply to rescale the output in order to compare it to mathematically correct multiplication. In neuronal terms, this implies a gain and offset at the summing neuron that would need to be specified if actual numerical multiplication were desired in a particular model. However, without this, the simple model of summing of the output of two “logarithmic” neurons provides a result proportional (with an offset) to the desired numerical (log) product over the entire dynamic range of the neuron.
Computing with the Leaky Integrate-and-Fire Neuron
315
Figure 5: (a) A plot of equation 3.2, the average relative error of the LIF products as a function of t0 /τ . N, the number of samples used for each point, is 10,000 (see text for details of this simulation). (b) A sample of 100 data points used to obtain the sixth point from the left in panel (a). t0 /τ for this point is 0.2, and the relative error in multiplication is approximately 5%. Both ordinate and abscissa 2 in (b) have units of IRh .
316
Doron Tal and Eric L. Schwartz
Figure 6: An integrate-and-fire neuron’s response over time to varying constant input values. Equilibrium is reached in approximately 25 ms, meaning that for logarithmic computation to occur to within a 5% accuracy, the input’s frequency cannot exceed 40 Hz, a reasonable requirement for cortical neurons.
varies temporally with an average period on the order of 25 ms. This time resolution corresponds to a maximum input frequency of 40 Hz that can be resolved by the LIF model presented above. A known model of multiplication that takes into account spike timing is the coincidence detector (Srinivasan and Bernard 1976) (this model is also the multiplicative mechanism used in Bugmann’s 1991 LIF model). One difficulty with coincidence detection is that for low spike rates modeled by a poisson distribution, its accuracy grows as the logarithm of the time window used to count coincidences. Thus, the activity of a single-cell coincidence detector whose inputs spike at a mean rate of 20 Hz would have to be integrated for approximately 20 seconds in order to achieve a 5% accuracy in its multiplication estimate. It has been argued, however, that the coincidence detector could be functioning on the faster temporal scale of dendritic summation (Softky 1995). (See also the reply by Shadlen and Newsome (1995) in the same issue, which analyzes the temporal summation properties of LIF coincidence detectors.)
Computing with the Leaky Integrate-and-Fire Neuron
317
4 Conclusion This work examined the nature of the observed compressive neuronal transfer function. A simple LIF account with four physical parameters was shown to account for a logarithmic current frequency relation. The single dimensionless parameter t0 /τ was shown to control the shape of a LIF neuron’s current frequency transfer function, changing it from linear (for small t0 /τ ) to quasi-logarithmic (for larger t0 /τ ) under biologically reasonable parameter ranges. The degree to which a cell’s transfer function approximates a logarithm was measured at different values of t0 /τ by computing the deviation of LIF−1 {LIF(A) + LIF(B)} from the product AB, using equation 3.2. The transfer function was shown to be approximately logarithmic (to within 5%) over a biologically reasonable and robust t0 /τ range. Acknowledgments The work for this article was supported by NIMH 5R01MH45969-04 and ONR N00014-95-1-0409. References Agin, D. 1964. Hodgkin-Huxley equations: Logarithmic relation between membrane current and frequency of repetitive activity. Nature 201, 625–626. Amit, D. J., Evans, M. R., and Abeles, M. 1990. Attractor neural networks with biological probe records. Network 1, 381–405. Barlow, H. B. 1969. Trigger features, adaptation and economy of impulses. In Information Processing in the Nervous System, K. N. Leibovic, ed., pp. 209–230. Springer-Verlag, Berlin. Bugmann, G. 1991. Summation and multiplication: Two distinct operation domains of leaky integrate-and-fire neurons. Network 2, 489–509. Cornsweet, T. N. 1970. Visual Perception. Academic Press, New York. Harmon, L. D. 1961. Studies with artificial neurons, I: Properties and functions of an artificial neuron. Kybernetik 3, 89–101. Hillman, D. E. 1979. Neural shape parameters and substructures as a basis of neuronal form. In The Neurosciences (4th Study Program), F. O. Schmitt and F. G. Worden, eds., pp. 432–437. MIT Press, Cambridge, MA. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. Journal of Physiology 117, 500–544. Horowitz, P., and Hill, W. 1989. The Art of Electronics. Cambridge University Press, Cambridge. Jack, J. 1979. An introduction to linear cable theory. In The Neurosciences (4th Study Program), F. O. Schmitt and F. G. Worden, eds., pp. 423–437. MIT Press, Cambridge, MA. Kawato, M., Hamaguchi, T., Murakami, F., and Tsukahara, N. 1984. Quantitative analysis of electrical properties of dendritic spines. Biol. Cybern. 50, 447–454.
318
Doron Tal and Eric L. Schwartz
Knight, B. W. 1972. Dynamics of encoding in a population of neurons. Journal of General Physiology 59, 734–766. Koch, C., and Poggio, T. 1992. Multiplying with synapses and neurons. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., pp. 315– 345. Harcourt Brace Jovanovich, San Diego, CA. Land, E. H. 1977. The retinex theory of color vision. Scientific American 237, 108–128. Ratliff, F. 1965. Mach Bands: Quantitative Studies on Neural Networks in the Retina. Holden-Day, New York. Reichardt, W., Egelhaaf, M., and Ai-ke, G. 1989. Processing of figure and background motion in the visual system of the fly. Biol. Cybern. 61, 327–345. Schwindt, P. C., and Crill, W. E. 1982. Factors influencing motorneuron rhythmic firing: Results from a voltage clamp study. J. Neurophysiol. 64, 206–224. Shadlen, M. N., and Newsome, W. T. 1995. Is there a signal in the noise? Current Opinion in Neurobiology 5, 248–250. Softky, W. R. 1995. Simple codes versus efficient codes. Current Opinion in Neurobiology 5, 239–247. Softky, W. R., and Koch, C. 1993. The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. Journal of Neuroscience 13, 334–350. Srinivasan, M. V., and Bernard, G. D. 1976. A proposed mechanism for multiplication of neural signals. Biol. Cybernetics 21, 227–236. Stein, R. B. 1967. The frequency of nerve action potentials generated by applied currents. Proceedings of the Royal Society B 167, 64–86. Usher, M., Schuster, H. G., and Niebur, E. 1993. Dynamics of populations of integrate-and-fire neurons, partial synchronization and memory. Neural Computation 5, 570–586. Wolfson, B., Gutnick, M. J., and Baldino, F. J. 1989. Electrophysiological characteristics of neurons in neocortical explant cultures. Exp. Brain Res. 76, 122–130. Yeshurun, Y., and Schwartz, E. L. 1989. Cepstral filtering on a columnar image architecture: A fast algorithm for binocular stereo segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence 11(7), 759–767.
Received August 1, 1995; accepted May 2, 1996.
Communicated by Wolfram Geistner
Effect of Delay on the Boundary of the Basin of Attraction in a Self-Excited Single Graded-Response Neuron K. Pakdaman INSERM V444 ISARS Facult´e de M´edecine Saint-Antoine, 75571 Paris Cedex 12, France
C. P. Malta Instituto de F´ısica, Universidade de S˜ao Paulo, CP 66318, 05389-970 S˜ao Paulo, Brazil
C. Grotta-Ragazzo Instituto de Matem´atica e Estat´ıstica, Universidade de S˜ao Paulo, CP 66281, 05389-970 S˜ao Paulo, Brazil
J.-F. Vibert INSERM V444 ISARS Facult´e de M´edecine Saint-Antoine, 75571 Paris Cedex 12, France
Little attention has been paid in the past to the effects of interunit transmission delays (representing axonal and synaptic delays) on the boundary of the basin of attraction of stable equilibrium points in neural networks. As a first step toward a better understanding of the influence of delay, we study the dynamics of a single graded-response neuron with a delayed excitatory self-connection. The behavior of this system is representative of that of a family of networks composed of graded-response neurons in which most trajectories converge to stable equilibrium points for any delay value. It is shown that changing the delay modifies the “location” of the boundary of the basin of attraction of the stable equilibrium points without affecting the stability of the equilibria. The dynamics of trajectories on the boundary are also delay dependent and influence the transient regime of trajectories within the adjacent basins. Our results suggest that when dealing with networks with delay, it is important to study not only the effect of the delay on the asymptotic convergence of the system but also on the boundary of the basins of attraction of the equilibria. 1 Introduction The time it takes for a signal to be transmitted from one neuron to another, referred to here as delay, can influence the behavior of neural network models. For example, in spiking neuron models, the firing pattern of single units as well as the coherence between the neurons’ discharge may drastically change as the delay is varied across critical values (an der Heiden 1981; Plant 1981; Vibert et al. 1994; Gerstner 1995; Pakdaman et al. 1996). In artificial Neural Computation 9, 319–336 (1997)
c 1997 Massachusetts Institute of Technology °
320
K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo
neural network (ANN) applications, it has been shown that implementing delays in discrete-time networks made of binary units enables them to store time-varying sequences (Sompolinsky and Kanter 1986; Herz et al. 1988). In some ANN applications using analog graded-response neuron (GRN) models, delays arising in hardware implementation may cause the network performance to deteriorate. For example, in a content-addressable memory network, stable equilibrium points are used for storing information (Hopfield 1984; Hirsch 1989), and delays due to hardware implementation may render a locally stable equilibrium point unstable, thus causing loss of the corresponding stored information (Marcus and Westervelt 1989). Such considerations have motivated a number of studies on the asymptotic behavior of GRN networks with delay (Babcock and Westervelt 1987; Marcus and Westervelt 1989; Burton 1991, 1993; Marcus et al. 1991; Roska and Chua 1992; Roska et al. 1992, 1993; B´elair 1993; Civalleri et al. 1993; Gilli 1993; Gopalsamy and He 1994a, 1994b; Ye et al. 1994, 1995; Finnochiaro and Perfetti 1995; Gilli 1995; Gopalsamy and Leung 1996). Studies of the local stability of the equilibria have developed criteria to avoid delay-induced instabilities in some networks (Marcus and Westerwelt 1989; B´elair 1993). Constraints on network parameters have also been provided for all, or almost all, trajectories to converge to stable equilibrium points in networks with delay (Burton 1991, 1993; Roska and Chua 1992; Roska et al. 1992, 1993; B´elair 1993; Civalleri et al. 1993; Gopalsamy and He 1994a, 1994b; Ye et al. 1994, 1995). One of the purposes of such studies is to ensure that the asymptotic behavior of the network with delay is similar to that of the corresponding network without delay. However, even in such (almost) convergent networks, important features in the system’s dynamics may still depend on the delay value. For instance, changing the delay can alter the boundary of the basins of attraction of the stable equilibrium points. This can be of prime importance in a content-addressable memory network: the position of the basin boundaries determines in which basin given information falls. Thus, changing the shape of the basin boundaries alters the classification of initial conditions performed by the network. We study how the delay may alter the basin boundary in a network constituted by a single GRN with a delayed excitatory self-connection. This system has been chosen because: • It is simple enough so that thorough theoretical and numerical analysis of its dynamics is possible. • The asymptotic behavior of the system is similar to that of the corresponding system without delay in the sense that the local asymptotic stability of the equilibria is not affected by the delay and most trajectories converge to stable equilibrium points for any value of the delay, allowing a focus on the effect of delay on the basin boundary. • The results are representative of the influence of delay on basin boundaries in larger networks.
Effect of Delay on the Boundary of the Basin of Attraction
321
In Section 2, we present the neuron model. The stationary regime is described in Section 3. The boundary between the basins of attraction of two locally asymptotically stable equilibrium points is estimated for different delay values in Section 4. The behavior of solutions on the boundary and their influence on the transient regime of the system are presented in Section 5. Generalizations of the results are discussed in Section 6. Mathematical aspects are left to the appendices. 2 The Graded-Response Neuron Model The GRN is described by its activation at time t, denoted a(t), and a sigmoidal output function σ (a). A decay rate γ is also implemented in the model (Cowan and Ermentrout 1978; Hopfield 1984; Pasemann 1993). We consider a neuron that has a delayed self-connection with strictly positive weight W and delay A. The neuron receives a constant input K. The neuron activation evolves according to the following delay differential equation (DDE): da (t) = −γ a(t) + K + Wσ (a(t − A)), dt
(2.1)
where the neuron transfer function σ is defined by σ (a) =
1 . 1 + e−a
(2.2)
Here, the initial condition of the system is constituted by the history of the neuron activation during a time interval corresponding to the delay. Thus, initial conditions for DDE 2.1 are the continuous functions φ defined on the interval [−A, 0] of length equal to the delay. Since changing the delay alters this interval, we need to specify how to compare initial conditions for a given delay with those for another value. We present two methods. Using the first method, an initial condition defined for a given delay is restricted to a shorter interval and thus defines an initial condition for a shorter delay. The second method consists of rescaling the time unit to the delay, so that for all delays, initial conditions are defined on the same interval of unit length [−1, 0]. Thus DDE 2.1 is transformed into: 1 dˆa 0 (t ) = −ˆa(t0 ) + K0 + W 0 σ (ˆa(t0 − 1)) τ dt0
(2.3)
where t0 = t/A, aˆ (t0 ) = a(t), and the parameters have been renamed to τ = γ A, K0 = K/γ , and W 0 = W/γ . 3 The Stationary Regime The asymptotic behavior of DDE 2.1 is analyzed in Appendix 7. The results can be summarized as follows: We define the gain as g = W/4 and the
322
K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo
dissipation as γ , the activation decay rate. For g < γ , there is one globally asymptotically stable point, noted x0 . For g > γ , there are two input values K− and K+ defined as: ! Ã p p W − W(W − 4γ ) W − 2γ + W(W − 4γ ) K− (γ , W) = −γ Log − 2γ 2 Ã K+ (γ , W) = γ Log
W − 2γ +
! p p W + W(W − 4γ ) W(W − 4γ ) − 2γ 2
(3.1)
such that when either K < K− or K > K+ , there is one globally asymptotically stable point, also noted x0 ; and for K− < K < K+ , the system is bistable: there are two locally asymptotically stable equilibria, noted x1 and x3 , and one unstable equilibrium point noted x2 , with x1 < x2 < x3 . From this point on, we suppose g > γ and K− < K < K+ . The set of initial conditions whose trajectories tend to a stable equilibrium point is referred to as the basin of attraction of that point. For any initial condition φ, there is a unique real number b1 (φ), such that φ +c is in the basin of attraction of x1 (respectively x3 ) if and only if c < b1 (φ) (respectively c > b1 (φ)). φ + b1 (φ) is in the boundary separating the two basins of attraction. Thus this boundary is the set of the zeros of b1 . All initial conditions φ “below” (respectively “above”) the boundary are in the basin of attraction of x1 (resp. x3 ), and satisfy b1 (φ) > 0 (respectively b1 (φ) < 0). 4 Numerical Determination of the Boundary The description presented in the previous section characterizes the boundary separating the two basins of attraction. We investigate numerically the dependence of the boundary on the delay. As the space of the initial conditions of DDE 2.1 is an infinite dimensional space, it is not possible to visualize the boundary between the basins in that space. Instead, sections of the boundary corresponding to families of initial conditions are computed as described below. A family of initial conditions depending on one parameter is selected (φα (t)). For each value of the real parameter α, we note β(α) = b1 (φα ). The orbit of an initial condition φ(t) = φα (t) + c converges to x1 or x3 depending on whether c is strictly smaller than or strictly larger than β(α). Thus the graph of β(α) in the (α, c) parameter plane represents the boundary separating the basins of attraction for initial conditions φ defined as above: points with coordinates (α, c) situated “below” (respectively “above”) the graph of β(α) correspond exactly to the functions φ with parameter α and bias c (φ(t) = φα (t) + c) whose orbit tends to x1 (respectively x3 ). This allows us to compute the basin boundary for a special family of initial conditions as a function of the parameter α and to represent it as a onedimensional graph. This method can be generalized to families depending
Effect of Delay on the Boundary of the Basin of Attraction
323
on two parameters (φα,ω (t)) for which the basin boundary β(α, ω) = b1 (φα,ω ) is a two-dimensional surface in the (α, ω, c) parameter space. The following two examples illustrate the method. In Figure 1, numerical and theoretical estimates of the boundary between the two basins for special families of initial conditions are presented. Figure 1a shows estimates of the basin boundary for affine functions φα (t) = αt, for −1 ≤ t ≤ 0. For a given delay A, we solve the rescaled equation (2.3) with the initial condition φα +c. We note βr (α) the value of β obtained for this family of initial conditions. It has been estimated for two different delays: A = 0.5 (dashed lines), A = 15 (solid lines). The thick lines were obtained by solving numerically the DDE, and the thin lines result from the theoretical approximation (see equation A.3). This approximation of βr (α) is a linear function. It fits the numerical result over some interval of α around 0. However, the length of this interval shrinks as the delay is decreased. For α = 0, the initial condition is constant, and all four graphs of β go through βr (0) = x2 as expected. DDE 2.1 preserves the order of initial conditions, so that βr (α) is an increasing function of α, and its graph has a positive slope. For delays close to zero, the slope tends to zero, and the graph of βr (α) is close to the straight horizontal line going through x2 = 0. This slope increases with the delay. The basin boundary can also be estimated for a restricted (if A < 1) or extended (if A > 1) initial condition defined by φα (t) = αt for −A ≤ t ≤ 0. We denote by βe (α) the value obtained in this way. Then we have: βe (Aα) = βr (α).
(4.1)
This relation allows us to obtain an estimate of the basin boundary for the first method of comparison—restriction—from the results presented in the previous paragraph. Figures 1b and 1c show estimates of the basin boundary for sine initial conditions (φα,ω (t) = α sin(ωt), −1 ≤ t ≤ 0). For each value of (α, ω), βr (α, ω) is shown for delays A = 0.5 (Fig. 1b1) and A = 5 (Fig. 1b2). Both theoretical approximations and the results of the numerical resolution of the DDE are represented for each delay. In both figures, the theoretical estimation matches the numerical one for low-amplitude sine functions (α close to zero). For a fixed ω, βr (α, ω) is a monotonous function of α that either increases or decreases from zero as α increases from zero. For a fixed α, the boundary displays oscillations that are damped as ω increases (Fig. 1c). The main difference due to the change in the delay is in the amplitude of the “waves” of the boundary surface. For the sake of brevity, only basin boundaries for rescaled initial conditions were presented. For restricted or extended initial conditions, the results are similar. In fact there is a relation similar to equation 4.1 linking the basin’s boundary of rescaled sine initial conditions to that of extended or restricted sine initial conditions.
324
K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo
5 Oscillations on the Boundary The boundary separating the two basins of attraction is formed by the solutions that oscillate indefinitely around x2 ; that is, φ belongs to the boundary if and only if for all t ≥ 0, there is θ with −A ≤ θ ≤ 0 such that a(t + θ ) = x2 .
Effect of Delay on the Boundary of the Basin of Attraction
325
For small delays, these oscillatory solutions are damped to x2 ; for large delays, they are either damped to x2 or asymptotically periodic. Such oscillations are due to the delay. Although they are unstable, they have an important influence on the behavior of the system. Solutions close to the boundary display transient oscillations before converging to one of the two stable equilibria x1 or x3 (Fig. 2a). Such transient oscillations are observed for the solution of any initial condition φ, which crosses x2 (i.e., the function φ − x2 changes sign on [−A, 0]). As the delay is increased, the duration of the transient oscillations increases exponentially with the delay (compare Fig. 2a with Fig. 2b). The transient oscillations and the lengthening of their duration with the delay are due to the fact that for large delays, solutions of DDE 2.1 behave transiently as the corresponding solutions of the following difference equation (Sharkovsky et al. 1993): a(t + A) =
K W σ (a(t)) + . γ γ
(5.1)
In contrast with DDE 2.1, the difference equation 5.1 has (infinitely many) nonconstant periodic solutions with nonnegligible basins of attraction for W > 4γ and K− < K < K+ . Thus for large enough delays, solutions of DDE 2.1, corresponding to initial conditions lying in the basin of attraction of a periodic solution of the difference equation 5.1, are transiently attracted to this nonconstant periodic solution before escaping toward one of the two stable equilibria. 6 Generalization Networks in which there is at least one directed path connecting any given neuron to any other one are referred to as irreducible. Understanding the
Figure 1: Facing page. The boundary separating the basins of attraction. (a) The boundary βr (α) of the basin of attraction for affine initial conditions with slope α and bias c, φα (t) = αt + c (−1 ≤ t ≤ 0), is shown for A = 0.5 (dashed line) and A = 15 (solid line). The thick lines correspond to the numerical result, and the thin lines correspond to the theoretical approximation (equation A.3) Abscissae: slope α, ordinates: bias c. (b) The boundary βr (α, ω) of the basin of attraction for sine initial conditions with amplitude α, angular velocity ω, and bias c: φα,ω (t) = α sin(ωt) tc (−1 ≤ t ≤ 0) is shown for A = 0.5 (1) and A = 5 (2). (c) Cross sections at α = 5 of β(α, ω). The thick lines correspond to the numerical result, and the thin lines correspond to the theoretical approximation (equation A.3). Solid lines for the long delay (A = 5) and dashed lines for the short delay (A = 0.5). Abscissae: angular velocity ω, ordinates: bias c. Parameters used: γ = 1, W = 6, K = −3.
326
K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo
Figure 2: Examples of trajectories. (a) Temporal evolution of the solution going through the initial condition: φ(t) = sin(4πt) − 0.02 for −0.5 ≤ t ≤ 0 (thin line) for a delay A = 0.5 and ψ(t) = sin(π t) − 0.02 for −2 ≤ t ≤ 0 (thick line) for a delay A = 2. Both initial conditions correspond to the same function sin(2π t) − 0.02 when the time unit is rescaled to the delay. The solutions tend to different equilibrium points depending on the delay value. Abscissa: time t; ordinates: activation a. (b) Temporal evolution of the solution going through the initial condition φ(t) = sin(0.2π t)−0.02 for −10 ≤ t ≤ 0, for a delay A = 10. This solution exhibits long-lasting transient oscillations before stabilizing close to the stable equilibrium point. Abscissa: time t (note the difference in the abscissa scale between A and B); ordinates: activation a. Parameters used: γ = 1, W = 6, and K = −3.
Effect of Delay on the Boundary of the Basin of Attraction
327
dynamics of such networks is important because they form the building blocks of other networks, such as cascades, and under appropriate conditions, the asymptotic behavior of a network can be derived from that of its constituting irreducible networks in cascade (Hirsch 1989). The results presented in Section 3 for the single self-exciting GRN can be generalized to excitatory irreducible GRN networks with delay. Details of the generalization are presented in Pakdaman et al. (1995c). The dynamics of an n neuron network evolve according to the following equations:
n X dai (t) = −γ ai (t) + Ki + Wij σj (aj (t − Aij )) dt j=1
1 ≤ i ≤ n.
(6.1)
The neuron transfer functions are taken to be sigmoidal, that is, σi is a smooth, bounded, strictly increasing function, such that there is a unique point pi such that σi00 (pi ) = 0. The connection matrix W = [Wij ] is assumed positive (Wij ≥ 0 for all i, j) and irreducible; it does not leave invariant any proper nontrivial subspace generated by a subset of the standard basis vectors for Rn . We define the gain g as the largest (real) eigenvalue of the n × n matrix [Wij σj0 (pj )] and the dissipation γ in the same way as for the single neuron. When g < γ , the network is globally asymptotically stable. When both g > γ and the absolute value of the input received by each unit in the network is large enough, the network is globally asymptotically stable. When g > γ , depending on the input, there can be several equilibrium points. In such cases, the basin boundaries are a finite number of hypersurfaces (codimension one manifolds) that partition the phase space into disjoint regions corresponding to the basins of attraction of the locally stable equilibria. The boundaries form the “upper” or “lower” bounds of a given basin of attraction; in the same way as for the single self-exciting neuron, the boundary is the upper bound of the basin of attraction of x1 and the lower bound of the basin of attraction of x3 . The boundaries contain necessarily unstable points, and the trajectories that oscillate around them. Similar results are also expected when the input K is no longer constant and is allowed to vary with time, as argued in Section B.1. The influence of delay on network dynamics has also been investigated in networks composed of a different version of the GRN, usually referred to as the “shunting model” (Destexhe and Gaspard 1993; Destexhe 1994; Houweling 1994; Louren¸co and Babloyantz 1994). The results presented in this article can be adapted to the dynamics of a single shunting model neuron with a delayed excitatory self-connection receiving a constant input (see Section B.2).
328
K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo
7 Discussion and Conclusion Dynamical systems with delayed feedback evolve in infinite dimensional phase spaces and therefore may display complex behavior that has no counterpart in corresponding systems without delay. For example, scalar systems with delayed feedback can display complex dynamics depending on the feedback nature—positive, negative, or mixed (an der Heiden and Mackey 1982; Malta and Grotta-Ragazzo 1991)—and the number of feedback loops (Glass and Malta 1990; Grotta-Ragazzo and Malta 1992). Moreover, the basin boundary in such systems can have an intricate structure (Losson et al. 1993). GRN neural networks with delayed interaction are also known to display delay-induced complex chaotic behavior (Marcus et al. 1991; Gilli 1993, 1995). In order to avoid such phenomena, the network parameters, such as the neurons’ gain, the connection weight, and the delays, can be restricted to ranges in which the asymptotic behavior of the network with delay is similar to that of the corresponding network without delay for almost all initial conditions. We studied the dynamics of a single GRN with a delayed self-exciting connection, which satisfies the above condition. For this system, we derived theoretical and numerical descriptions of the boundary separating the basins of attraction of two locally stable equilibrium points. The description of the asymptotic behavior of the system is sketched in Figure 3 and can be summarized as follows: The boundary separating the basins of attraction of the two stable equilibria divides the phase space into two regions, in the same way a plane divides the three-dimensional space (gray surface in Fig. 3). Initial conditions on each side of the boundary form the basin of attraction of one of the two stable equilibria. Moreover, solutions on the boundary oscillate around the unstable point (closed trajectory in Fig. 3). Two effects of the delay on the system behavior were then presented: 1. The location of the basin boundary changes with the delay. Although the system with delay has exactly the same stable equilibria as the corresponding system without delay and almost all solutions tend to these stable equilibria, the solution of a given initial condition may tend to either of the two stable equilibria or even remain on the boundary depending on the value of the delay (Fig. 2a). In this respect, the delay should be considered among important parameters shaping the basin of attraction of equilibria. For a content-addressable memory network, this means that the information retrieved for an initial condition may change if the delay is modified. The asymptotic behavior of the single neuron with a delayed self-exciting connection is representative of excitatory irreducible networks. Moreover, as indicated by Hirsch (1989), other networks, in-
Effect of Delay on the Boundary of the Basin of Attraction
329
Figure 3: Schematic representation of the basin boundary. The boundary separating the basins of attraction of the two stable equilibria x1 and x3 is schematically represented by a surface going through the unstable point x2 and a periodic solution (the closed trajectory). Solutions below (respectively above) the boundary converge to x1 (respectively x3 ). However, some are transiently attracted to the oscillatory solutions on the boundary, before eventually approaching one of the two stable equilibrium points, as exemplified in the figure.
cluding negative connection weights, can also be studied in the same way, for example, after appropriate changes of variables. This possibility is discussed in length in Hirsch (1989), and we refer to that paper for general results as well as specific examples. 2. Delay-induced transient oscillations. Due to the presence of sustained oscillations on the boundary, and attracting undamped oscillations at the limit of infinite delays, solutions of some initial conditions display transient oscillations before converging to one of the two stable equilibria (Fig. 2b). As the delay is increased, more and more solutions display such behavior, and the duration of the transient oscillations considerably increases. In many ANN applications, information processing by the network terminates when the system stabilizes in one of its equilibria (Hopfield 1984; Hirsch 1989). Thus, delay-induced transient oscilla-
330
K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo
tions may considerably slow information processing by lengthening the transient regime duration. Larger networks also display delay-induced transient oscillations. These were reported in two-neuron networks from numerical investigations (Babcock and Westervelt 1987) and from both numerical and analytical studies (Pakdaman et al. 1995b). The long-lasting transient oscillations are due to the presence of attracting nonconstant periodic solutions of the discrete time network associated with the continuous time network. Discrete time neural networks and, in general, discrete time dynamical systems are more prone to display periodic (or oscillatory) solutions than corresponding continuous time systems; therefore, delay-induced transient oscillations can be expected in many neural networks with delay. In conclusion, for neural networks designed to converge to equilibria, such as content-addressable memories, the fractions of errors in retrieving the relevant information depend on the shape of the basin of attraction of the equilibrium points. We have shown that even in an almost convergent system, the boundary of the basins may be modified by the presence of delays. This can cause the performance to deteriorate if the delay is not under control. Appendix A: Asymptotic Behavior The neuron activation evolves according to the following DDE: da (t) = −γ a(t) + K + Wσ (a(t − A)), dt
(A.1)
where σ is the sigmoidal function defined in equation 2.2 and W and A are strictly positive real numbers. This system is similar to the positive feedback loop with a piecewise constant transfer function studied by an der Heiden and Mackey (1982), as well as several models analyzed by Sharkovsky and collaborators (1993). Let C [−A, 0] be the space of continuous real functions of the interval [−A, 0]. For φ in C [−A, 0], there exists a unique real function a(t, φ) on the interval [−A, +∞), such that a(t, φ) = φ(t) for −A ≤ t ≤ 0, and a(t, φ) satisfies equation A.1 for t ≥ 0 (Hale and Verduyn Lunel 1993). For such a solution of the DDE, we note at (φ) the element of C [−A, 0], defined by at (φ)(θ) = a(t + θ, φ), for −A ≤ θ ≤ 0. Let φ0 and φ1 be two elements in C [−A, 0]; then we say that φ0 is larger (respectively strictly larger) than φ1 noted φ0 ≥ φ1 (respectively φ0 À φ1 ) if for all θ in [−A, 0] we have φ0 (θ ) ≥ φ1 (θ ) (respectively φ0 (θ ) > φ1 (θ )). DDE A.1 generates an order-preserving semiflow as for φ0 and φ1 in C [−A, 0]
Effect of Delay on the Boundary of the Basin of Attraction
331
we have: if φ0 ≥ φ1 and φ0 6= φ1 , then for t > 2A
at (φ0 ) À at (φ1 ).
(A.2)
The fact that DDE A.1 generates an order-preserving semiflow strongly limits the possible asymptotic behaviors of the solutions (Smith 1987; Roska et al. 1992). The number of the equilibrium points of DDE A.1, as well as their stability, is deduced from the analyses presented in Cowan and Ermentrout (1978) and Pasemann (1993) and the fact that the local stability of the stable equilibria in DDEs generating strongly order-preserving semiflows is not affected by the delay (Smith 1987). • For g = W/4 < γ , all solutions converge uniformly asymptotically to x0 . • For g = W/4 > γ 1. For either K < K− or K > K+ , all solutions converge uniformly asymptotically to x0 . 2. For K− < K < K+ , the union of the basins of attraction of x1 and x3 is a dense open subset of C [−A, 0]. Let u be a strictly positive function defined on the interval [−A, 0]. For φ in C [−A, 0], there is a unique real number bu (φ) such that at (φ + c.u) tends asymptotically to x1 (respectively x3 ) for all c < bu (φ) (respectively c > bu (φ)); where for a real number c, we note φ + c.u the element of C [−A, 0] defined by (φ + c.u)(t) = φ(t) + c.u(t), for −A ≤ t ≤ 0. a(t, φ + bu (φ).u) oscillates around x2 , in the sense that the function a(t, φ + bu (φ).u) − x2 has at least one zero on each interval kA < t < (k + 1)A, for k ≥ −1. The boundary of the basin of attraction of the two locally stable equilibrium points is formed by such oscillating solutions. For small delays these oscillations are damped to x2 ; for large delays they are either damped to x2 or asymptotically periodic (Arino 1993; Mallet-Paret and Sell 1996). bu is a strictly decreasing continuous map from C [−A, 0] to the real line. The boundary of the basins of attraction of the two locally stable equilibria is the closed set {φ, such that bu (φ) = 0}. More precisely, the boundary is a codimension one, an unordered Lipschitz manifold (Tak´acˇ 1991). Let ν = νA be the real solution of the characteristic equation associated with DDE A.1 at x2 , and let u(t) = eνt , for −A ≤ t ≤ 0, be the most unstable eigendirection of the linearized equation at x2 . We approximate bu (φ) by pu (φ), where −pu (φ) is the projection of φ onto the line directed by u along the eigenspace of all the other solutions of the characteristic equation of DDE A.1
332
K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo
(Hale and Verduyn Lunel 1993): pu (φ) = x2 − φ(0) + Wσ 0 (x2 )e−ν.A
Z
0
−A
(x2 − φ(u))e−ν.u du. (A.3)
This approximation is satisfactory near the unstable equilibrium. When the characteristic equation has a single solution with a positive real part, close to x2 , the basin boundary coincides locally with the stable manifold of the unstable equilibrium point x2 . In this case, the zeros of pu represent the tangent space to the stable manifold at x2 . Appendix B: Generalization B.1 Time-Varying Inputs. Neural networks may be used for processing time-varying inputs. In this case, the input K varies with time. The nonautonomous system thus obtained is also order-preserving. For periodic input signals of period T, the time-T map associated with the system is a strongly order-preserving discrete-time map, and there are a number of results on the fixed point of such maps and also on the basins of attractions of the stable fixed points that generalize those presented in the previous sections (Tak´acˇ 1992; Hirsch 1995). The behavior of a single GRN neuron with a delayed self-connection in the presence of slow periodic inputs has been investigated numerically (Pakdaman et al. 1995a). The general study of the behavior of GRN networks with delay in response to periodic inputs will be presented elsewhere. B.2 The Shunting Model. In the shunting model, the neuron activation evolves according to the following DDE: da (t) = −γ a(t) + K + W(E − a(t))σ (a(t − A)) , dt
(B.1)
where E > 0 is the positive reversal potential. Let K0 = K/γ ; we further suppose E > K0 . For any function φ in C [−A, 0], there is a unique solution of DDE B.1 a(t, φ) going through φ, and there is a time T ≥ −A such that the activation is below the reversal potential after T; that is, for all t ≥ T, a(t, φ) < E. The restriction of the set of initial conditions of DDE B.1 to continuous functions on the interval [−A, 0] that are smaller than E generates a strictly orderpreserving semiflow. Based on the above considerations and the study of the local stability of the equilibria of DDE B.1, we can state the following: Let K0 be the unique solution of the following equation, such that K0 < E: (eK0 +2 − 1)(E − K0 ) + 4 = 0 .
(B.2)
Effect of Delay on the Boundary of the Basin of Attraction
333
For K0 > K0 there is one globally asymptotically stable point, noted x0 , with K0 < x0 < E. For K0 > K0 , there are two positive weight values W− and W+ such that when either W < W− or W > W+ , there is one globally asymptotically stable point, also noted x0 with K0 < x0 < E; and for W− < W < W+ , the system is bistable: there are two locally asymptotically stable equilibria, noted x1 and x3 , and one unstable equilibrium point noted x2 , with K0 < x1 < x2 < x3 < E. For the bistable system, most trajectories converge to the stable equilibrium points in the sense that the union of the basins of attraction of the two stable equilibrium points is a dense open set. The boundary separating the two basins of attraction can be characterized in the same way as for the GRN. Acknowledgments We thank E. Av-Ron and Pr. O. Arino for helpful comments. This work was partially supported by USP/COFECUB under project U/C 9/94. One of us (CPM) is also partially supported by CNPq (the Brazilian Research Council). References Arino, O. 1993. A note on “the discrete Lyapunov function . . . ” Journal of Differential Equations 104, 169–181. Babcock, K. L., and Westervelt, R. M. 1987. Dynamics of simple electronic neural networks. Physica D 28, 305–316. B´elair, J. 1993. Stability in a model of a delayed neural network. Journal of Dynamics and Differential Equations 5, 607–623. Burton, T. A. 1991. Neural networks with memory. Journal of Applied Mathematics and Stochastic Analysis 4, 313–332. Burton, T. A. 1993. Averaged neural networks. Neural Networks 6, 677–680. Civalleri, P. P., Gilli, M., and Pandolfi, L. 1993. On stability of cellular neural networks with delay. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 40, 157–165. Cowan, J., and Ermentrout G. B. 1978. Some aspects of the “eigenbehavior” of neural nets. In Studies in Mathematical Biology, part I, S. A. Levin, ed., 67–117. Mathematical Association of America, Providence, RI. Destexhe, A. 1994. Oscillations, complex spatiotemporal behavior, and information transport in networks of excitatory and inhibitory neurons. Physical Review E 50, 1594–1606. Destexhe, A., and Gaspard, P. 1993. Bursting oscillations from a homoclinic tangency in a time delay system. Physics Letters A 173, 386–391. Finnochiaro, M., and Perfetti, R. 1995. Relation between template spectrum and stability of cellular neural networks with delay. Electronics Letters 31, 2024– 2026. Gerstner, W. 1995. Time structure of the activity in neural network models. Physical Review E 51, 738–758.
334
K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo
Gilli, M. 1993. Strange attractors in delayed cellular neural networks. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 40, 849– 853. Gilli, M. 1995. A spectral approach for chaos prediction in delayed cellular neural networks. International Journal of Bifurcation and Chaos 5, 869–875. Glass, L., and Malta, C. P. 1990. Chaos in multi-looped negative feedback systems. Journal of Theoretical Biology 145, 217–223. Gopalsamy, K., and He, X.-Z. 1994a. Stability in asymmetric Hopfield nets with transmission delays. Physica D 76, 344–358. Gopalsamy, K., and He, X.-Z. 1994b. Delay-independent stability in bidirectional associative memory networks. IEEE Transactions on Neural Networks 5, 998– 1002. Gopalsamy, K., and Leung, I. 1996. Delay-induced periodicity in a neural netlet of excitation and inhibition. Physica D 89, 395–426. Grotta-Ragazzo, C., and Malta, C. P. 1992. Singularity structure of the Hopf bifurcation surface of a differential equation with two delays. Journal of Dynamics and Differential Equations 4, 617–650. Hale, J. K., and Verduyn Lunel, S. M. 1993. Introduction to Functional Differential Equations. Applied Mathematical Sciences vol. 99. Springer-Verlag, New York. an der Heiden, U. 1981. Analysis of Neural Networks. Lecture Notes in Biomathematics vol. 35. Springer-Verlag, New York. an der Heiden, U., and Mackey, M. C. 1982. The dynamics of production and destruction: Analytic insight into complex behavior. Journal of Mathematical Biology 16, 75–101. Herz, A., Sulzer, B., Kuhn, ¨ R., and Van Hemmen, J. L. 1988. The Hebb rule: Storing static and dynamic objects in an associative neural network. Europhysics Letters 7, 663–669. Hirsch, M. W. 1989. Convergent activation dynamics in continuous time networks. Neural Networks 2, 331–350. Hirsch, M. W. 1995. Fixed points of monotone maps. Journal of Differential Equations 123, 171–179. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Science USA 81, 3088–3092. Houweling, A. R. 1994. The effects of interneural time delay in a simple two neuron shunting model. Personal communication. Losson, J., Mackey, M. C., and Longtin, A. 1993. Solution multistability in firstorder nonlinear differential delay equations. Chaos 3, 167–176. Louren¸co, C., and Babloyantz, A. 1994. Control of chaos in networks with delay: A model for synchronization of cortical tissue. Neural Computation 6, 1141– 1154. Mallet-Paret, J., and Sell, G. R. 1996. The Poincar´e Bendixson theorem for monotone cyclic feedback systems with delay. Journal of Differential Equations 125, 441–489.
Effect of Delay on the Boundary of the Basin of Attraction
335
Malta, C. P., and Grotta-Ragazzo, C. 1991. Bifurcation structure of scalar delayed equations. International Journal of Bifurcation and Chaos 1, 657–665. Marcus, C. M., Waugh, F. R., and Westervelt, R. M. 1991. Nonlinear dynamics and stability of analog neural networks. Physica D 51, 234–247. Marcus, C. M., and Westervelt, R. M. 1989. Stability of analog neural networks with delay. Physical Review A 39, 347–359. Pakdaman, K., Alvarez, F., Diez-Mart´ınez, O., and Vibert, J.-F. 1995a. Single neuron with recurrent connection: Response to slow periodic modulation. Biosystems. In press. Pakdaman, K., Grotta-Ragazzo, C., Malta, C. P., and Vibert, J.-F. 1995b. Delayinduced transient oscillations in a two-neuron network. Tech. Rep., Publica¸coes ˜ IFUSP/P-1183, Universidade de S˜ao Paulo, Brazil. Pakdaman, K., Malta, C. P., Grotta-Ragazzo, C., and Vibert, J.-F. 1995c. Asymptotic behavior of excitatory irreducible networks of graded-response neurons. Tech. Rep., Publica¸coes ˜ IFUSP/P-1200, Universidade de S˜ao Paulo, Brazil. Pakdaman, K., Vibert, J.-F., Boussard, E., and Azmy, N. 1996. Single neuron with recurrent excitation: Effect of the transmission delay. Neural Networks 9, 797–818. Pasemann, F. 1993. Dynamics of a single neuron model. International Journal of Bifurcation and Chaos 3, 271–278. Plant, R. E. 1981. A FitzHugh differential-difference equation modeling recurrent neural feedback. SIAM Journal of Applied Mathematics 40, 150–162. Roska, T., and Chua, L. O. 1992. Cellular neural networks with non-linear and delay-type template elements and non-uniform grids. International Journal of Circuit Theory and Applications 20, 469–481. Roska, T., Wu, C. F., Balsi, M., and Chua, L. O. 1992. Stability and dynamics of delay-type general and cellular neural networks. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 39, 487–490. Roska, T., Wu, C. F., and Chua, L. O. 1993. Stability of cellular neural networks with dominant nonlinear and delay-type templates. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 40, 270–272. Sharkovsky, A. N., Maistrenko, Yu. L., and Romanenko, E. Yu. 1993. Difference Equations and Their Applications. Kluwer, Dordrecht. Smith, H. 1987. Monotone semiflows generated by functional differential equations. Journal of Differential Equations 66, 420–442. Sompolinsky, H., and Kanter, I. 1986. Temporal association in asymmetric neural network. Physical Review Letter 57, 2861–2864. Tak´acˇ , P. 1991. Domains of attraction of generic omega-limit sets for strongly monotone semi-flows. Zeitschrift fur Analysis und ihre Anwendungen 10, 275– 317. Tak´acˇ , P. 1992. Domains of attraction of generic ω-limit sets for strongly mono¨ die reine und angewandte Mathematik tone discrete-time semigroups. Journal fur 423, 101–173. Vibert, J.-F., Pakdaman, K., and Azmy, N. 1994. Inter-neural delay modification synchronizes biologically plausible neurons. Neural Networks 7, 589–607.
336
K. Pakdaman, J.-F. Vibert, C. P. Malta, and C. Grotta-Ragazzo
Ye, H., Michel, A. N., and Wang, K. 1994. Global stability and local stability of Hopfield neural networks with delays. Physical Review E 50, 4206–4213. Ye, H., Michel, A. N., and Wang, K. 1995. Qualitative analysis of CohenGrossberg neural networks with multiple delays. Physical Review E 51, 2611– 2618.
Received June 16, 1995; accepted May 6, 1996.
Communicated by Terrence Fine
Shattering All Sets of k Points in “General Position” Requires (k − 1)/2 Parameters Eduardo D. Sontag Department of Mathematics, Rutgers University, New Brunswick, NJ 08903 USA
For classes of concepts defined by certain classes of analytic functions depending on n parameters, there are nonempty open sets of samples of length 2n + 2 that cannot be shattered. A slighly weaker result is also proved for piecewise-analytic functions. The special case of neural networks is discussed.
1 Introduction The generalization capabilities of neural networks are often quantified in terms of the maximal number of possible binary classifications that could be obtained, by means of weight assignments, on any given set of input patterns. Obviously the larger the number of such potential classifications, the lower the predictability that is possible on the basis of a partial assignment (“loading of training data”) already achieved. Thus, it is of interest to obtain useful upper bounds on this number and in particular to study the Vapnik-Chervonenkis (VC) dimension, which is the size of the largest set of inputs that can be shattered (arbitrary binary labeling is possible). Recent results (cf. Koiran and Sontag in press and also Maass 1994 for closely related work and a survey) show that the VC dimension grows at least as fast as the square n2 of the number of adjustable weights n in the net, and this number might grow as fast as n4 (Karpinski and Macintyre in press). These results are quite pessimistic, since they imply that the number of samples required for reliable generalization, in the sense of PAC learning, is very high. On the other hand, it is conceivable that those sets of input patterns that can be shattered are all in some sense “special” and that if we ask instead, as done in the classical literature in pattern recognition, for the shattering of all sets in “general position” (e.g., Cover’s work on capacity of perceptrons, Cover 1988), then an upper bound of O(n) might hold. Strong evidence for this possibility was provided by Adam Kowalczyk (in press), who showed that this indeed happens for the (very special) case of hard-threshold neural networks with a fully connected first layer. In this article, I establish a linear upper bound for arbitrary sigmoidal (as well as threshold) neural nets, proving that in that sense, the perceptron results can be recovered in a strong sense (up to a factor of two). Neural Computation 9, 337–348 (1997)
c 1997 Massachusetts Institute of Technology °
338
Eduardo D. Sontag
There is potential relevance of the results to variations of PAC learning when one weakens the requirement that generalization capabilities must hold with respect to all possible input distributions; this will be the subject of future work. The estimate is also useful in the very different context of understanding computational abilities; as an illustration, note that the main result is being employed by Maass (in press) for contrasting the computational power of spiking neurons (Maass 1996) with that of sigmoidal neural networks. 1.1 Parametric Classes, Shattering. Assume given two positive integers n and m and a function β: Rn × Rm → R . In this context we will call n the number of parameters, elements of Rn parameter vectors (or weights), m the input dimension, elements of Rm input vectors, and β a response function. We think of β(x, u) as a function of the inputs u for each parameter vector x. For each integer k, a k-tuple of the form uE = (u1 , . . . , uk ) ∈ (Rm )k = Rmk will be called an (ordered) sample of length k. We want to measure how rich the set of functions {β(x, ·), x ∈ Rn } is for binary classification purposes. Consider a sample uE = (u1 , . . . , uk ) of length k. A sequence εE = (ε1 , . . . , εk ) ∈ {−1, 1}k will be said to be assignable to uE (with respect to the given response function) if there exists some parameter vector x so that sign (β(x, ui )) = εi , i = 1, . . . , k, where we are defining sign (y) = 1 if y > 0 and −1 otherwise. We will say here that a sample uE of length k is ±-shattered (by β) if, for each εE ∈ {−1, 1}k , either εE or −Eε is assignable to uE. For each k, let Sk be the subset of (Rm )k = Rmk , possibly empty, consisting of those samples that are ±-shattered. We define n o µ = µβ := sup k ≥ 1 | Sk is a dense subset of Rmk . Our goal is to show that under reasonable assumptions on β, necessarily µ ≤ 2n + 1. The organization of this article is as follows. Section 2 discusses the precise statement of the result, for functions β that are analytic and “definable” as well as the fact that this bound is optimal. Section 3 provides a proof of the result, which is based on material on analytic functions and especially on new results from the theory of exponentials, which are reviewed briefly in the Appendix. Section 4 shows how to generalize the result to the piecewise
Shattering All Sets of k Points in “General Position”
339
analytic definable case, which allows the consideration of neural networks with discontinuous activation functions. 2 Precise Statements and Remarks Note that 2n+1 is the best possible upper bound, assuming that one wants to prove a result that includes at least all polynomially parameterized classes. To show this, we must exhibit for each integer n > 0 a polynomial response function β: Rn × Rmn → R with the property that S2n+1 is dense in R2n+1 . In fact, we give, for each n, a β with mn = 1 so that every sample of length 2n + 1 consisting of distinct vectors can be ±-shattered: β((x1 , . . . , xn ), u) := (u − x1 )2 (u − x2 )2 . . . (u − xn )2 . Take any uE ∈ R2n+1 with all ui distinct. The claim is that this is ±-shattered. Indeed, pick any εE = (ε1 , . . . , ε2n+1 ) ∈ {−1, 1}2n+1 . Without loss of generality, we assume that there are s ≤ n elements “−1” in the sequence εE (otherwise, we assign −Eε instead) and these are εi1 , . . . , εis . Let xj := uij for j = 1, . . . , s and pick xs+1 , . . . , xn to be any elements not in the set {u1 , . . . , u2n+1 }. We have that β(x, u) = 0 for each u = uij , j = 1, . . . , s, and β(x, u) > 0 for all other u ∈ R. Thus sign (β(x, ui )) = εi for all i. The question we address next is what are “reasonable” assumptions on the class β so that an upper bound like 2n+1 holds. It is not difficult to see that merely asking β to be an analytic function is not sufficient even to guarantee finiteness. This is illustrated trivially by the example β(x, u) = sin(xu), which has n = m = 1 but for which µβ = +∞ (this is a consequence of the fact that if the numbers u1 , . . . , uk , π are linearly independent over the rationals, then the set of vectors (sin(`u1 ), . . . , sin(`uk )), as ` ranges over the positive integers, is a dense subset of [−1, 1]k ). In fact, it is possible to define a fixed analytic function β, with m = 1, for which Sk consists of all sequences of k distinct elements of R, not merely dense in Rkm , for all k (see Sontag 1992). Thus, analyticity is in itself not enough to obtain nontrivial bounds. We will assume that β is analytic and definable. The Appendix recalls the definition and basic facts about (“exp-RA”) definable functions. Informally, these are functions that can be defined in terms of any first-order logic sentence built out of the standard propositional connectives, existential and universal quantification, and which involve rational operations and exponentiation. (Also allowed in the formulas are certain restricted analytic functions, for instance, arctan(x), but sin(x) is not included.) In particular—and this is of interest in the context of applications to “artificial neural networks”—any response of a “multilayer sigmoidal network” with “activation function” 1/(1 + e−x ) is definable in this sense because it is obtained by iterative compositions and polynomial combinations involving the activation function, which in turn is constructed by means of rational operations involving exponentiation.
340
Eduardo D. Sontag
Theorem 1. 2n + 1.
If β: Rn × Rm → R is an analytic definable response, then µβ ≤
This result is proved in the next section. Remark 2.1. In computational learning theory, one defines shattering of a sample uE by the requirement that each εE be assignable, and studies the set S+ k consisting of all those samples that are shattered. For our results, it seems more natural to look at ±-shattering because we then obtain a bestpossible bound. In any case, the two concepts are almost identical. Since S+ k ⊆ Sk for all k, the upper bound 2n + 1 will also apply in particular to km E sup{k ≥ 1|S+ k is a dense subset of R }. (Conversely, if u can be ±-shattered, then the new response function with n + 1 parameters, and defined for parameters vectors (x0 , x1 , . . . , xn ) ∈ Rn+1 by β0 ((x0 , x), u) := 1 − x0 β(x, u), shatters the same sample uE.) In learning theory and statistical estimation, the Vapnik-Chervonenkis (VC) dimension is defined as the supremum of the integers k such that S+ k is nonempty; here we are asking questions regarding shattering of arbitrary sets of points in “general position,” a concept that appears in Cover’s and in other classical pattern recognition work. For the best-known (obviously larger) upper bounds for VC dimension, for closely related classes of responses, see Karpinski and Macintyre (in press) and see also Maass (1994) and Koiran and Sontag (1996) for superlinear lower bounds. 3 Proof of Main Result We need to show that if k ≥ 2n + 2 then Sk is T not dense; that is, there is some nonempty open subset Q ⊆ Rmk so that Q Sk = ∅. It will be enough to show this when k = 2n + 2. In fact, we will establish a much stronger fact: m(n+1) , there is a nonempty open for each nonempty open subset T W0 ⊆ R subset Q ⊆ W02 such that Q S2n+2 = ∅. For any integer `, we introduce the following subset of Rm` : A` := {(u1 , . . . , u` ) | ∃ x st β(x, ·) 6≡ 0, β(x, u1 ) = · · · = β(x, u` ) = 0} . By corollary A.1, it follows that A` has a dimension strictly less than m` whenever ` = n + 1. The content of that lemma is totally intuitive: for each fixed x, except those for which the map β(x, u) is identically zero, the set of u1 ’s so that β(x, u1 ) = 0 is of dimension at most m − 1 (since this set is the set of zeroes of a nontrivial analytic function); and similarly for each of u2 , . . . , u` . Thus, for each such x, the Cartesian product of these sets, namely, the set of vectors (u1 , . . . , u` ) so that β(x, u1 ) = · · · = β(x, u` ) = 0, has dimension at most `(m − 1). If we now let x range over the whole parameter space Rn (but not including those x for which β(x, ·) ≡ 0), the set A` is obtained as a union of an n-parameter family of sets, each of which is of dimension ≤ `(m − 1), from which it follows that A` has dimension at most
Shattering All Sets of k Points in “General Position”
341
n + `(m − 1), which is less than m` provided ` = n + 1. The technicalities have to do only with defining precisely the meaning of “dimension”—this will mean the maximum possible dimension of a submanifold of the set— and establishing the obvious formulas, such as the fact that the union of an n-parameter family of sets of dimension ≤ s has dimension ≤ n + s. Let W0 be any open subset of Rm(n+1) . The fact that β is definable implies, by lemma A.2, that there are only two possibilities for each set A` : either it contains some open set, or it is nowhere dense. Since when ` = n + 1 this set does not have full dimension, as shown in the previous paragraph, the first alternative cannot hold, so it must be nowhere dense. That is, its closure has an empty interior, which implies that W0 \ clos An+1 is a nonempty open set. Thus there is some open set V of W0 ⊆ Rm(n+1) of the special product form V = V1 × · · · × Vn+1 , for T some nonempty connected open subsets V1 , . . . , Vn+1 of Rm , so that V An+1 = ∅. Now let Q ⊆ W02 ⊆ R2m(n+1) be the open set defined as follows: Q := V × V. The claim is that Q element
T
S2n+2 = ∅. To establish this fact, we take an arbitrary
(u1 , . . . , un+1 , v1 , . . . , vn+1 ) of Q and show that it cannot be ±-shattered; that is, there is some sequence εE = (ε1 , . . . , ε2n+2 ) ∈ {−1, 1}2n+2 for which neither εE nor −Eε is assignable. Consider the following sequence: εE := (1, . . . , 1, −1, . . . , −1) . {z } | {z } | n+1
n+1
Assume that there would exist some parameter vector x so that this vector is assigned: β(x, ui ) > 0, i = 1, . . . , n + 1, β(x, vi ) ≤ 0, i = 1, . . . , n + 1. Observe that, in particular, it holds that β(x, ·) 6≡ 0 for this x. For each i = 1, . . . , n + 1, let γi : [0, 1] → Vi ⊆ Rm be a continuous function so that γi (0) = ui and γi (1) = vi . (Recall that Vi is connected.) Since β(x, γi (t)) is a continuous function of t, we conclude that for each i there is some ti so that β(x, γi (ti )) = 0. Writing E := (w1 , . . . , wn+1 ) ∈ An+1 , which contradicts the wi := γi (ti ), this says that w E ∈ V1 × · · · × Vn+1 = V and that V was picked so that it is disjoint fact that w from An+1 . Thus εE is not assignable. If −Eε would be assignable, we would
342
Eduardo D. Sontag
have an x such that β(x, ui ) ≤ 0 for i = 1, . . . , n + 1 and β(x, vi ) > 0 for i = 1, . . . , n + 1, and the same argument leads once more to a contradiction. Remark 3.1. Observe that the definability property is used when proving that An+1 cannot be dense. If instead one would pick, for instance, β(x, u) = sin(xu), the property is false. For this example, A2 is the union of all the lines in R2 that pass through (0, 0) and have a rational slope (plus the y-axis), so A2 in this case is a dense set with an empty interior. Remark 3.2. As remarked during the proof, a stronger result is established. m In particular, for each nonempty open T subset W ⊆ R , there is a nonempty open subset Q ⊆ W 2n+2 such that Q S2n+2 = ∅. (Apply with W0 := W n+1 .) 4 Discontinuous Responses The main theorem requires that the response function β be analytic as well as definable. It is possible, however, to extend its validity to some other classes of responses, including those obtained by considering neural networks that use “heaviside” activation functions ½ 0 if v ≤ 0 H(v) = 1 if v > 0 as well as sigmoidal activations. Such responses are a particular case of a piecewise-analytic definable response, meaning a response that is described by a (possibly different) definable analytic function in each piece of a partition of the space of parameters and inputs. The partition in turn is assumed to be determined by the signs of a finite number of definable analytic functions. We now make this concept precise. The map β: Rn × Rm → R is a piecewise-analytic definable response if there is some integer p and there exist p + 2p analytic definable functions φ 1 , . . . , φ p : Rn × Rm → R and
©
ψα : Rn × Rm → R, α ∈ {0, 1}p
ª
such that the following property holds: for each (x, u) ∈ Rn × Rm , letting ¡ ¢ α(x, u) := H(φ1 (x, u)), . . . , H(φp (x, u)) , then β(x, u) = ψα(x,u) (x, u). (That is, in each of the components of the partition of the (x, u) space determined by the possible signs of the functions φi , β is expressed by an appropriate ψα .)
Shattering All Sets of k Points in “General Position”
343
Corollary 4.1. If β is piecewise analytic definable, then µβ ≤ 2n + 3. The proof will be based on the construction of an analytic definable response β 0 : Rn+1 × Rm → R with the following property: if a given sequence εE = (ε1 , . . . , εk ) ∈ {−1, 1}k is assignable to a sample uE = (u1 , . . . , uk ), with respect to the response β, then it is also assignable to the same sample with respect to the response β 0 . This means that if a sample uE is ±-shattered with respect to β, then it is also ±-shattered with respect to β 0 , so we have µβ ≤ µβ 0 . By the main theorem we know that µβ 0 ≤ 2(n + 1) + 1, so the desired conclusion follows. We now show how to construct β 0 . As a preliminary step, we establish a simple fact regarding the approximation of piecewise constant functions by sigmoidal combinations. We first introduce some additional notations. Given any vector r = (r1 , . . . , rp ) ∈ Rp , we let γ (r) := (H(r1 ), . . . , H(rp )). Consider the function p
θ: Rp × R2 → R : (r, s) 7→ sγ (r) p
(where we are indexing the coordinates of s ∈ R2 by vectors α = (α1 , . . . , αp ) ∈ {0, 1}p ). Now let σ : R → R be the function σ (v) = (1 + e−x )−1 and, denoting σ1 := σ , σ0 := 1 − σ , consider, for each positive real number ρ, the (definable and analytic) function: p
χ ρ : R p × R2 → R : X (r, s) 7→ σα1 (ρ 2 r1 − ρ) . . . σαp (ρ 2 rp − ρ)sα .
(4.1)
α∈{0,1}p
Lemma 4.1.
p
For each pair (r, s) ∈ Rp × R2 ,
lim χρ (r, s) = θ(r, s)
ρ→∞
(4.2)
and θ(r, s) = 0 ⇒ lim (1 + ρ 2 )χρ (r, s) = 0 . ρ→∞
(4.3)
Proof. Observe that we can rewrite θ as follows: X θ(r, s) = Hα1 (r1 ) . . . Hαp (rp )sα , α∈{0,1}p
where H1 := H, H0 := 1 − H. Thus the first conclusion of the lemma is an immediate consequence of the fact that limρ→∞ σ (ρ 2 r − ρ) = H(r) for each
344
Eduardo D. Sontag
r ∈ R. Now assume that (r, s) is such that θ (r, s) = sγ (r) = 0. This means that the term corresponding to the index α = γ (r) does not appear in the sum (cf. equation 4.1) defining χρ (r, s). Fix any other index α; we claim that lim (1 + ρ 2 )σα1 (ρ 2 r1 − ρ) . . . σαp (ρ 2 rp − ρ) = 0
ρ→∞
(from which the second conclusion will follow). Indeed, since α 6= γ (r), there must exist some j ∈ {1, . . . , p} for which αj 6= H(rj ); fix one such j. If αj = 1, then H(rj ) = 0 means that rj ≤ 0, and therefore (1+ρ 2 )σ (ρ 2 rj −ρ) → 0 as ρ → ∞; since the other terms are bounded, it follows that their product converges to zero. If instead αj = 0, then H(rj ) = 1 means that rj > 0, and thus (1 + ρ 2 )(1 − σ (ρ 2 rj − ρ)) → 0 as ρ → ∞; again since the other terms are bounded, the product goes to zero. Equation 4.3 implies in particular that when θ (r, s) = 0, χρ (r, s) −
1 <0 1 + ρ2
for all ρ large enough. Together with equation 4.2, this means that µ ¶ 1 sign χρ (r, s) − = sign θ (r, s) 1 + ρ2 p
as ρ → ∞, for each (r, s) ∈ Rp × R2 . We now prove corollary 4.1. Let β, φ1 , . . . , φp , and the ψα ’s be as in the definition of piecewise-analytic definable response. Denote 8(x, u) := p (φ1 , . . . , φp ) and let 9(x, u) ∈ R2 be the vector whose αth component is ψα (x, u). Then β(x, u) = θ(8(x, u), 9(x, u)). We define β 0 ((ρ, x), u) := χρ (8(x, u), 9(x, u)) −
1 . 1 + ρ2
Thus sign β 0 ((ρ, x), u) = sign β(x, u) as ρ → ∞, for each (x, u) ∈ Rn × Rm . Now assume that the sequence εE = (ε1 , . . . , εk ) ∈ {−1, 1}k is assignable to the sample uE = (u1 , . . . , uk ) with respect to the response β. This means that there is some parameter vector x ∈ Rn so that sign(β(x, ui )) = εi for each i = 1, . . . , k. For large enough ρ, then, also sign(β 0 ((ρ, x), ui )) = εi for all i, so we conclude that εE is also assignable to the same sample with respect to the response β 0 . Remark 4.1. The bound 2n + 3 can be reduced to 2n + 1 provided that the classes of functions φi (x, ·) and ψ(x, ·) are closed under scaling and translation (as is the case commonly when dealing with artificial neural networks, where these are affine functions and the parameters x correspond to the coefficients of linear combinations [weights] and constant term [bias]). In that case, the parameter ρ can be absorbed into the remaining parameters.
Shattering All Sets of k Points in “General Position”
345
Appendix We first review some elementary facts regarding real-analytic geometry. If M is any (analytic, second-countable) manifold, by a σ -analytic subset Z of M we mean one that can be decomposed into a countable union of embedded analytic submanifolds; for such a set Z, we define dim Z as the largest dimension of a submanifold appearing in such a union. (The dimension is well defined, in the sense that it does not depend on the particular decomposition.) Consider now any analytic map π : M → N between two manifolds. For each y ∈ N, the “fiber” My = {x ∈ M | π(x) = y} provides an example of a σ analytic subset (and if M is connected, either My = M or dim My < dim M). Now take any σ -analytic subset Z of M. Then the image π(Z) is a σ -analytic subset of N, having dim π(Z) ≤ dim Z, and the following inequality holds: h \ i Z . dim Z ≤ dim π(Z) + max dim My y∈π(Z)
(A.1)
(This inequality is a simple consequence of stratification theory; it is proved in the appendix of Sontag 1996. Note, incidentally, that a strict inequality may hold, as illustrated by the case M = R2 , N = R, π = projection on first factor, and Z = the union of the x and y axes.) We prove the following fact, which is basically theorem 1 in Sontag (1996). (That theorem could also be applied more directly, but the slight generalization given here should be useful for other purposes.) Lemma A.1. Let X, U1 , . . . , U` be (analytic, second countable) manifolds, with dimensions n, m1 , . . . , m` respectively, and each Ui connected. Assume given ` analytic maps βi : X × U1 × · · · Ui → R, i = 1, . . . , ` so that the following property holds for each i = 1, . . . , `: ∀ x ∈ X, ∀ u1 ∈ U1 , . . . , ∀ ui−1 ∈ Ui−1 , ∃u ∈ Ui so that βi (x, u1 , . . . , ui−1 , u) 6= 0. Let
G` := {(x, u1 , . . . , u` ) ∈ X × U1 × · · · × U` | β1 (x, u1 ) = · · · = β` (x, u1 , . . . , u` ) = 0} .
(A.2)
346
Eduardo D. Sontag
Then this is a σ -analytic set with dim G` ≤ n +
` X
mj − ` .
j=1
Proof. Define Gi := {(x, u1 , . . . , ui ) ∈ X × U1 × · · · × Ui | β1 (x, u1 ) = · · · = βi (x, u1 , . . . , ui ) = 0} for each Pi i = 1, . . . , `, and G0 := X; we prove by inducmj − i. For i = 0 the result holds, so we now tion on i that dim Gi ≤ n + j=1 assume it has been proved for i and show the case i + 1. Let π: X × U1 × · · · × Ui+1 → X × U1 × · · · × Ui be the projection on the first 1+i factors. Write Z := Gi+1 . Note that π(Z) ⊆ Gi by definition of these sets, so in particular it holds by inductive hypothesis that dim π(Z) ≤ n +
i X
mj − i .
(A.3)
j=1
T For each fixed y = (x, u1 , . . . , ui ) ∈ π(Z), π −1 (y) Z = {x} × {u1 } × · · · × {ui } × Q, where © ª Q(x,u1 ,...,ui ) := u ∈ Ui+1 | βi+1 (x, u1 , . . . , ui , u) = 0 . Property A.2 implies that dim Q(x,u1 ,...,ui ) ≤ mi+1 −1. Thus from equation A.1 and using equation A.3, we conclude that dim Z ≤ n +
i X
mj − i + mi+1 − 1 = n +
j=1
i+1 X
mj − (i + 1) ,
j=1
and the induction is complete. Corollary A.1. ` the set
Let β : Rn × Rm → R be analytic, and consider for any integer
A` := {(u1 , . . . , u` ) | ∃ x st β(x, ·) 6≡ 0, β(x, u1 ) = · · · = β(x, u` ) = 0}. Then this is a σ -analytic set of dimension at most n + `(m − 1). Proof. Let X = {x ∈ Rn | β(x, ·) 6≡ 0}; this is an open subset of Rn and hence a submanifold. (If X = ∅, there is nothing to prove.) Apply lemma A.1 with βi (x, u1 , . . . , ui ) := β(x, ui ) (and all Ui = U); observe that property A.2 holds by definition of X. Then dim G` ≤ n + `(m − 1), and the same bound applies to its projection A` on the U components.
Shattering All Sets of k Points in “General Position”
347
Finally, we review several facts about exp-RA definable functions. This discussion is based on recent work in model theory due to Gabrielov, Van den Dries, Wilkie, and many others; see Macintyre and Sontag (1993) and Sontag (1996) for more details and extensive bibliographical references. A restricted analytic (RA) function Rq → R is any function that is real analytic on some compact subset C of Rq and is zero outside C (examples: the functions sin and cos, which equal sin and cos on [−π/2, π/2] and are zero outside this interval). Formally, consider the structure L = (R, +, ·, <, 0, 1, exp, { f, f ∈ RA}), and the corresponding language for the real numbers with addition, multiplication, and order, as well as one function symbol for real exponentiation and one for each restricted analytic function. An (exp-RA) definable set is any subset of Rk defined by a first-order formula over L with k free variables. A definable function is a function β : M → N whose graph is a definable set in this sense, and where N and M are definable subsets of two spaces Rl1 and Rl2 , respectively. For example, any function obtained by compositions of rational operations and taking exponentials is definable, such as 1/(1 + e−x ); also definable is the function arctan(x), because its graph is characterized by the formula “y = arctan(x) iff −π/2 < y < π/2 and sin(y) = x cos(y).” Notice that any set obtained from a function β by logical operations (such as the set A` in corollary A.1) is definable provided that β is definable. The following fact is a nontrivial consequence of the work in logic cited above; see precise references in Sontag (1996): Lemma A.2. Let S be a definable subset of Rq for some q. Then either S contains an open subset, or it is a finite union of connected embedded submanifolds of Rq (and in particular is nowhere dense). Acknowledgments This work was supported in part by US Air Force grant AFOSR-94-0293. References Cover, T. M. 1988. Capacity problems for linear machines. In Pattern Recognition, L. Kanal, ed., pp. 283–289. Thompson Book Co. Karpinski, M., and Macintyre, A. in press. Polynomial bounds for VC dimension of sigmoidal and general Pfaffian neural networks. J. Computer Systems Sciences. Koiran, P., and Sontag, E. D. in press. Neural networks with quadratic VC dimension. J. Computer Systems Sciences. Kowalczyk, A. in press. Estimates of storage capacity of mutilayer perceptron with threshold logic units. Neural Networks. Maass, M. 1994. Perspectives of current research about the complexity of learning in neural nets. In Theoretical Advances in Neural Computation and Learning,
348
Eduardo D. Sontag
V. P. Roychowdhury, K. Y. Siu, and A. Orlitsky, eds., pp. 295–336. Kluwer, Boston. Maass, W. 1996. Lower bounds for the computational power of networks of spiking neurons. Neural Computation 8, 1–40. Maass, W. in press. The third generation of neural network models. In Proc. 7th Australian Conference on Neural Networks 1996, Canberra, Australia. Macintyre, A., and Sontag, E. D. 1993. Finiteness results for sigmoidal “neural” networks. In Proc. 25th Annual Symp. Theory Computing, pp. 325–334, San Diego, CA. Sontag, E. D. 1992. Feedforward nets for interpolation and classification. J. Comp. Syst. Sci. 45, 20–48. Sontag, E. D. 1996. Critical points for least-squares problems involving certain analytic functions, with applications to sigmoidal nets. Advances in Computational Mathematics 5, 245–268.
Received March 18, 1996; accepted May 21, 1996.
Communicated by Shun-ichi Amari
Statistical Inference, Occam’s Razor, and Statistical Mechanics on the Space of Probability Distributions Vijay Balasubramanian Department of Physics, Princeton University, Princeton, NJ 08544 USA
The task of parametric model selection is cast in terms of a statistical mechanics on the space of probability distributions. Using the techniques of low-temperature expansions, I arrive at a systematic series for the Bayesian posterior probability of a model family that significantly extends known results in the literature. In particular, I arrive at a precise understanding of how Occam’s razor, the principle that simpler models should be preferred until the data justify more complex models, is automatically embodied by probability theory. These results require a measure on the space of model parameters and I derive and discuss an interpretation of Jeffreys’ prior distribution as a uniform prior over the distributions indexed by a family. Finally, I derive a theoretical index of the complexity of a parametric family relative to some true distribution that I call the razor of the model. The form of the razor immediately suggests several interesting questions in the theory of learning that can be studied using the techniques of statistical mechanics. 1 Introduction In recent years increasingly precise experiments have directed the interest of biophysicists toward learning in simple neural systems. The typical context of such learning involves an estimation of some behaviorally relevant information from a statistically varying environment. For example, the experiments of de Ruyter and collaborators have provided detailed measurements of the adaptive encoding of wide field horizontal motion by the H1 neuron of the blowfly (de Ruyter et al. 1995). Under many circumstances the associated problems of statistical estimation can be fruitfully cast in the language of statistical mechanics, and the powerful techniques developed in that discipline can be brought to bear on questions regarding learning (Potters and Bialek 1994). This article concerns a problem that arises frequently in the context of biophysical and computational learning: the estimation of parametric models of some true distribution t based on a collection of data drawn from t. If we are given a particular family of parametric models (gaussians, for example), the task of modeling t is reduced to parameter estimation, which is a relatively well-understood, though difficult, problem. Much less is known Neural Computation 9, 349–368 (1997)
c 1997 Massachusetts Institute of Technology °
350
Vijay Balasubramanian
about the task of model family selection. For example, how do we choose between a family of gaussians and a family of 50 exponentials as a model for t based on the available data? In this article we will be concerned with the latter problem, on which considerable ink has already been expended in the literature (Rissanen 1984, 1986; Barron 1985; Clarke and Barron 1990; Barron and Cover 1991; Wallace and Freeman 1987; Yamanishi 1995; MacKay 1992a, 1992b; Moody 1992; Murata et al. 1994). The first contribution is to cast Bayesian model family selection more clearly as a statistical mechanics on the space of probability distributions in the hope of making this important problem more accessible to physicists. In this language, a finite dimensional parametric model family is viewed as a manifold embedded in the space of probability distributions. The probability of the model family given the data can be identified with a partition function associated with a particular energy functional. The formalism bears a resemblance to the description of a disordered system in which the number of data points play the role of the inverse temperature and in which the data play the role of the disordering medium. Exploiting the techniques of low-temperature expansions in statistical mechanics, it is easy to extend existing results that use gaussian approximations to the Bayesian posterior probability of a model family to find “Occam factors” penalizing complex models (Clarke and Barron 1990; MacKay 1992a, 1992b). We find a systematic expansion in powers of 1/N, where N is the number of data points, and identify terms that encode accuracy, model dimensionality, and robustness, as well as higher-order measures of simplicity. The subleading terms, which can be important when the number of data points is small, represent a limited attempt to move analysis of Bayesian statistics away from asymptotics toward the regime of small N that is often biologically relevant. The results presented here do not require the true distribution to be a member of the parametric family under consideration, and the model degeneracies that can threaten analysis in such cases are dealt with by the method of collective coordinates from statistical mechanics. Some connections with the minimum description length principle and stochastic complexity are mentioned (Barron and Cover 1991; Clarke and Barron 1990; Rissanen 1984, 1986; Wallace and Freeman 1987). The relationship to the network information criterion of Murata et al. (1994) and the “effective dimension” of Moody (1992) is discussed. In order to perform Bayesian model selection, it is necessary to have a prior distribution on the space of parameters of a model. Equivalently, we require the correct measure on the phase space defined by the parameter manifold in the analog statistical mechanical problem considered in this article. In the absence of well-founded reasons to pick a particular prior distribution, the usual prescription is to pick an unbiased prior density that weights all parameters equally. However, this prescription is not invariant under reparametrization, and I will argue that the correct prior should give equal weight to all distributions indexed by the parameters. Requiring
Statistical Inference, Occam’s Razor, and Statistical Mechanics
351
all distributions to be a priori equally likely yields Jeffreys’ prior on the parameter manifold, giving a new interpretation of this choice of prior density (Jeffreys 1961). Finally, consideration of the large N limit of the asymptotic expansion of the Bayesian posterior probability leads us to define the razor of a model—a theoretical index of the complexity of a parametric family relative to a true distribution. In statistical mechanical terms, the razor is the quenched approximation of the disordered system studied in Bayesian statistics. Analysis of the razor using the techniques of statistical mechanics can give insights into the types of phenomena that can be expected in systems that perform Bayesian statistical inference. These phenomena include phase transitions in learning and adaptation to changing environments. In view of the length of this article, applications of the general framework developed here to specific models relevant to biophysics will be left to future publications. 2 Statistical Inference and Statistical Mechanics Suppose we are given a collection of outcomes E = {e1 , . . . , eN }, ei ∈ X drawn independently from a density t. Suppose also that we are given two parametric families of distributions A and B, and we wish to pick one of them as the model family that we will use. The Bayesian approach to this problem consists of computing the posterior conditional probabilities Pr(A|E) and Pr(B|E) and picking the family with the higher probability. Let A be parameterized by a set of real parameters 2 = {θ1 , . . . , θd }. Then Bayes’ rule tells us that Pr(A) Pr(A|E) = Pr(E)
Z dd 2 w(2) Pr(E|2).
(2.1)
In this expression Pr(A) is the prior probability of the model family, w(2) is a prior density on the parameter space, and Pr(E) = Pr(A) Pr(E|A) + Pr(B) Pr(E|B) is a density on the N outcome sample space. The measure induced by the parameterization of the d dimensional parameter manifold is denoted dd 2. Since we are interested in comparing Pr(A|E) with Pr(B|E), the density Pr(E) is a common factor that we may omit, and for lack of any better choice we take the prior probabilities of A and B to be equal and omit them.1 Finally, throughout this article we will assume that the model families of interest have compact parameter spaces. This condition is easily
1 Bayesian inference is sometimes carried out by looking for the best-fit parameter that maximizes w(2) Pr(E|2) rather than by averaging over the 2 as in equation 2.1. However, the average over model parameters is important because it encodes Occam’s razor, or a preference for simple models. Also see Balasubramanian (1996).
352
Vijay Balasubramanian
relaxed by placing regulators on noncompact parameter spaces, but we will not concern ourselves with this detail here. 2.1 Natural Priors or Measures on Phase Space. In order to make further progress, we must identify the prior density w(2). In the absence of a well-motivated prior, a common prescription is to use the uniform distribution on the parameter space since this is deemed to reflect complete ignorance (MacKay 1992a). In fact, this choice suffers from the serious deficiency that the uniform priors relative to different parameterizations can assign different probability masses to the same subset of parameters (Jeffreys 1961; Lee 1989). Consequently, if w(2) was uniform in the parameters, the probability of a model family would depend on the arbitrary parameterization. The problem can be solved by making the much more reasonable requirement that all distributions rather than all parameters are equally likely.2 In order to implement this requirement, we should give equal weight to all distinguishable distributions on a model manifold. However, nearby parameters index very similar distributions. So let us ask: How do we count the number of distinct distributions in the neighborhood of a point on a parameter manifold? Essentially this is a question about the embedding of the parameter manifold in the space of distributions. Points that are distinguishable as elements of Rn may be mapped to indistinguishable points (in some suitable sense) of the embedding space. To answer the question, let 2p and 2q index two distributions in a parametric family, and let E = {e1 , . . . , eN } be drawn independently from one of 2p or 2q . In the context of model estimation, a suitable measure of distinguishability can be derived by asking how well we can guess which of 2p or 2q produced E. Let αN be the probability that 2q is mistaken for ² be the 2p , and let βN be the probability that 2p is mistaken for 2q . Let βN smallest possible βN given that αN < ². Then Stein’sR lemma tells us that ² = D(2p k2q ), where D(pkq) = dx p(x) ln(p(x)/q(x)) limN→∞ (−1/N) ln βN is the relative entropy between the densities p and q (Cover 1991). As shown in the Appendix, the proof of Stein’s lemma shows that the ² exceeds a fixed β ∗ in the region where κ/N ≥ D(2p k2q ) minimum error βN ∗ with κ ≡ − ln β + ln(1 − ²).3 By taking β ∗ close to 1 we can identify the region around 2p where the distributions are not very distinguishable from the one indexed by 2p . As N grows large for fixed κ, any 2q in this region is necessarily close to 2p since D(2p k2q ) attains a minimum of zero when 2p = 2q . Therefore, setting 12 = 2q − 2p , Taylor expansion gives D(2p k2q ) ≈ (1/2)Jij (2p )12i 12 j + O(123 ) where
2
This applies the principle of maximum entropy on the invariant space of distributions rather than the arbitrary space of parameters. 3 This assertion is not strictly true. See the Appendix for more details.
Statistical Inference, Occam’s Razor, and Statistical Mechanics
353
Jij = ∇φi ∇φj D(2p k2p + 8)|8=0 is the Fisher information.4 (We use the convention that repeated indices are summed over.) In a certain sense, the relative entropy, D(2p k2q ), appearing in this problem is the natural distance between probability distributions in the context of model selection. Although it does not itself define a metric, the Taylor expansion locally yields a quadratic form with the Fisher information acting as the metric. If we accept Jij as the natural metric, differential geometry immediately tells us that the p reparameterization invariant measure on the parameter manifold is dd 2 det J (Amarip1985; Amari et al. 1987). NormalR izing this measure by dividing by dd 2 det J gives the so-called Jeffreys’ prior on the parameters. A more satisfying explanation of the choice of prior proceeds by directly counting the number of distinguishable distributions in the neighborhood of a point on a parameter manifold. Define the volume of indistinguishability at levels ², β ∗ , and N to be the volume of the region around 2p where κ/N ≥ D(2p k2q ) so that the probability of error in distinguishing 2q from 2p is high. We find to leading order: µ V²,β ∗ ,N =
2πκ N
¶d/2
1 1 p . 0(d/2 + 1) det Jij (2p )
(2.2)
If β ∗ is very close to one, the distributions inside V²,β ∗ ,N are not very distinguishable, and the Bayesian prior should not treat them as separate distributions. We wish to construct a measure on the parameter manifold that reflects this indistinguishability. We also assume a principle of translation invariance by supposing that volumes of indistinguishability at given values of N, β ∗ , and ² should have the same measure regardless of where in the space of distributions they are centered. An integration measure reflecting these principles of indistinguishability and translation invariance can be defined at each level β ∗ , ², and N by covering the parameter manifold economically with volumes of indistinguishability and placing a delta function in the center of each element of the cover. This definition reflects indistinguishability by ignoring variations on a scale smaller than the covering volumes and reflects translation invariance by giving each covering volume equal weight in integrals over the parameter manifold. The measure can be normalized by an integral over the entire parameter manifold to give a prior distribution. The continuum limit of this discretized measure is obtained by taking the limits β ∗ → 1, ² → 0, and N → ∞. In this limit, the measure counts distributions that are completely indistinguishable (β ∗ = 1) even in the presence of an infinite amount of data (N = ∞).5 4 I have assumed that the derivatives with respect to 2 commute with expectations taken in the distribution 2p to identify the Fisher information with the matrix of second derivatives of the relative entropy. 5 The α and β errors can be treated more symmetrically using the Chernoff bound instead of Stein’s lemma, but I will not do that here.
354
Vijay Balasubramanian
To see the effect of the above procedure, imagine a parameter manifold that can be partitioned into k regions in each of which the Fisher information is constant. Let Ji , Ui , and Vi , respectively, be the Fisher information, parameteric volume, and volume of indistinguishability in the ith region. Then the prior assigned to the ith volume by the procedure will be p p Pk Pk (Uj /Vj ) = Ui det Ji / j=1 Uj det Jj . Since all the β ∗ , ², Pi = (Ui /Vi )/ j=1 and N dependencies cancel, we are now free to take the continuum limit of Pi . This suggests that the prior density induced by the prescription described in the previous paragraph is: w(2) = R
p det J(2) p . d d 2 det J(2)
(2.3)
By paying careful attention to technical difficulties involving sets of measure zero and certain sphere packing problems, it can be rigorously shown that the normalized continuum measure on a parameter manifold that reflects indistinguishability and translation invariance is w(2) or Jeffreys’ prior (Balasubramanian 1996). In essence, the heuristic argument and the derivation in Balasubramanian (1996) show how to “divide out” the volume of indistinguishable distributions on a parameter manifold and hence give equal weight to equally distinguishable volumes of distributions. In this sense, Jeffreys’ prior is seen to be a uniform prior on the distributions indexed by a parameteric family. 2.2 Connection with Statistical Mechanics. Putting everything together, we get the following expression for the Bayesian posterior probability of a parameteric family in the absence of any prior knowledge about the relative likelihood of the distributions indexed by the family: R Pr(A|E) =
h ³ ´i p dd 2 det J exp −N − ln Pr(E|2) N p . R dd 2 det J
(2.4)
This equation resembles a partition function with a temperature 1/N and an energy function (−1/N) ln Pr(E|2). The dependence on the data E is similar to the dependence of a disordered partition function on the specific set of defects introduced into the system. The analogy can be made stronger since the strong law of large numP bers says that (−1/N) ln Pr(E|2) = (−1/N) N i=1 ln Pr(ei |2) converges in the almost sure sense to: ¸ Z µ ¶ Z · t(x) − ln Pr(ei |2) = dx t(x) ln − dx t(x) ln (t(x)) Et N Pr(x|2) = D(tk2) + h(t) . (2.5)
Statistical Inference, Occam’s Razor, and Statistical Mechanics
355
Here D(tk2) is the relative entropy between the true distribution and the distribution indexed by 2, and h(t) is the differential entropy of the true distribution that is presumed to be finite. With this large N limit in mind, we rewrite the posterior probability in equation 2.4 as the following partition function: R d √ −N(H +H ) 0 d d 2 Je R , (2.6) Pr(A|E) = √ d d 2 J where H0 (2) = D(tk2) and Hd (E, 2) = (−1/N) ln Pr(E|2) − D(tk2) − h(t). (Equation 2.6 differs from equation 2.4 by an irrelevant factor of exp [−Nh(t)].) H0 can be regarded as the “energy” of the “state” 2, while Hd is the additional contribution that arises by interaction with the “defects” represented by the data. It is instructive to examine the quenched approximation to this disordered partition function. (See Ma 1985 for a discussion of quenching in statistical mechanical systems.) Quenching is carried out by taking the expectation value of the energy of a state in the distribution generating the defects. In the above system Et [Hd ] = 0, giving the quenched posterior probability R Pr(A|E)Q =
√ dd 2 Je−ND(tk2) R . √ dd 2 J
(2.7)
In Section 4 we will see that the logarithm of the posterior probability converges to the logarithm of the quenched probability in a certain sense. This will lead us to regard the quenched probability as a sort of theoretical index of the complexity of a parameteric family relative to a given true distribution. 3 Asymptotic Analysis or Low-Temperature Expansion Equation 2.4 represents the full content of Bayesian model selection. However, in order to extract some insight, it is necessary to examine special cases. Let ln Pr(E|2) be a smooth function of 2 that attains a global minimum at ˆ and assume that Jij (2) is a smooth function of 2 that is positive definite 2 ˆ Finally, suppose that 2 ˆ lies in the interior of the compact parameter at 2. space and that the values of local minima are bounded away from the global minimum by some b.6 For any given b, for sufficiently large N, the Bayesian ˆ and posterior probability will then be dominated by the neighborhood of 2, ˆ we can carry out a low-temperature expansion around the saddlepoint at 2. We take the metric on the parameter manifold to be the Fisher information since the Jeffreys’ prior has the form of a measure derived from such a metric. 6
In Section 3.2 we will discuss how to relax these conditions.
356
Vijay Balasubramanian
This choice of metric also follows the work described in Amari (1985) and Amari et al. (1987). We will use ∇µ to indicate the covariant derivative with respect to 2µ , with a flat connection for the Fisher information metric.7 Readers who are unfamiliar with covariant derivatives may read ∇µ as the partial derivative with respect to 2µ since I will not be emphasizing the geometric content of the covariant derivative. Let I˜µ1 ,...,µi = (−1/N)∇µ1 , . . . , ∇µi ln Pr(E|2)|2ˆ and Fµ1 ,...,µi = ∇µ1 , . . . , ∇µi Tr ln Jij |2ˆ where Tr represents the trace of a matrix. Writing (det J)1/2 as exp [(1/2)Tr ln J], we Taylor expand the exponent in the integrand of ˆ and rescale the integration variable to the Bayesian posterior around 2 1/2 ˆ 8 = N (2 − 2) to arrive at: £ ¤ 1 R ˆ ˆ ˜ µ ν − ln Pr(E|2)− 2 Tr ln J(2) e N−d/2 dd 8e−((1/2)Iµν φ φ +G(8)) R p Pr(A|E) = . (3.1) dd 2 det Jij Here G(8) collects the terms that are suppressed by powers of N: · ¸ ∞ X 1 1 1 µ1 µ(i+2) µ1 µi ˜ Iµ1 ,...,µi+2 φ , . . . , φ Fµ ,...,µi φ , . . . , φ − G(8) = √ 2i! 1 Ni (i + 2)! i=1 · ¸ 1˜ 1 1 Iµ1 µ2 µ3 φ µ1 φ µ2 φ µ3 − Fµ1 φ µ1 = √ 2 N 3! · ¸ µ ¶ 1 1 1 1˜ Iµ1 ,...,µ4 φ µ1 , . . . , φ µ4 − Fµ1 µ2 φ µ1 φ µ2 + O . (3.2) + N 4! 2 2! N3/2 As before, repeated indices are summed over. The integral in equation 3.1 may now be evaluated in a series expansion using a standard trick from statistical mechanics (Itzykson and Drouffe 1989). Define a “source” h = {h1 , . . . , hd } as an auxiliary variable. Then it is easy to verify that: Z
˜
dd 8 e−((1/2)Iµν φ
µ ν
φ +G(8))
= e−G(∇h )
Z
˜
dd 8 e−((1/2)Iµν φ
µ ν
φ +hµ φ µ )
,
(3.3)
where the argument of G, 8 = (φ 1 , . . . , φ d ), has been replaced by ∇h = {∂h1 , . . . , ∂hd } and we assume that the derivatives commute with the integral. The remaining obstruction is the compactness of the parameter space. We make the final assumption that the bounds of the integration can be exˆ is sufficiently in the interior tended to infinity with negligible error since 2 or because N is sufficiently large. Performing the gaussian integral in equation 3.3 and applying the differential operator exp G(∇h ) we find an asymptotic series in powers of 1/N. 7 See Amari (1985) and Amari et al. (1987) for discussions of differential geometry in a statistical setting.
Statistical Inference, Occam’s Razor, and Statistical Mechanics
357
It turns outpto be most useful to examine χE (A) ≡ − ln Pr(A|E). Defining R V = dd 2 det J(2) we find to O(1/N): # à ! ˆ ˆ det Jij (2) d 1 − ln Pr(E|2) χE (A) = N + ln N − ln ˆ N 2 2 det I˜µν (2) # " (2π)d/2 − ln V ( i 1 I˜µ1 µ2 µ3 µ4 h ˜−1 µ1 µ2 ˜−1 µ3 µ4 (I ) (I ) + ··· + N 4! h i Fµ µ − 1 2 (I˜−1 )µ1 µ2 + (I˜−1 )µ2 µ1 2 2! i I˜µ µ µ I˜ν ν ν h − 1 2 3 1 2 3 (I˜−1 )µ1 µ2 (I˜−1 )µ3 ν1 (I˜−1 )ν2 ν3 + · · · 2! 3! 3! i Fµ1 Fµ2 h ˜−1 µ1 µ2 (I ) + ··· − 2! 4 2! 2! ) i Fµ1 I˜µ2 µ3 µ4 h ˜−1 µ1 µ2 ˜−1 µ3 µ4 (I ) (I ) + ··· . + 2! 2 2! 3! "
(3.4)
Terms of higher orders in 1/N are easily evaluated with a little labor, and systematic diagrammatic expansions can be developed (Itzykson and Drouffe 1989). In the next section I discuss the meaning of equation 3.4. 3.1 Meaning of the Asymptotic Expansion. I will show that the Bayesian posterior measures simplicity and accuracy of a parameteric family by examining equation 3.4 and noting that models with larger Pr(A|E), and hence smaller χE (A), are better. Since the expansion in equation 3.4 is only asymptotic and not convergent, the number of terms that can be reliably kept will depend on the specific model family A under consideration, so we will simply analyze the general behavior of the first few terms. Furthermore, χE (A)/N√is a random variable whose standard deviation can be expected to be O(1/ N). Although these fluctuations threaten to overwhelm subleading terms in the expansion, it is sensible to consider the meaning of these terms because they can be important when N is not asymptotically large. ˆ The O(N) term, N(− ln Pr(E|2)/N), which dominates asymptotically, is the log likelihood of the data evaluated at the maximum likelihood point.8 It measures the accuracy with which the parameteric family can describe the available data. We will see in section 4 that for sufficiently large N, model 8 This term is O(N), not O(1), because (1/N) ln Pr(E|2) ˆ approaches a finite limit at large N.
358
Vijay Balasubramanian
families with the smallest relative entropy distance to the true distribution are favored by this term. The term of O(N) arises from the saddlepoint value of the integrand in equation 2.4 and represents the Landau approximation to the partition function. The term of O(ln N) penalizes models with many degrees of freedom and is a measure of simplicity. This term arises physically from the statistical fluctuations around the saddlepoint configuration. These fluctuations cause the partition function in equation 2.4 to scale as N−d/2 , leading to the logarithmic term in χE . Note that the terms of O(N) and O(ln N) have appeared together in the literature as the stochastic complexity of a parameteric family relative to a collection of data (Rissanen 1984, 1986). This definition is justified by arguing that a family with the lowest stochastic complexity provides the shortest codes for the data in the limit that N → ∞. Our results suggest that stochastic complexity is merely a truncation of the logarithm of the posterior probability of a model family and that adding the subleading terms in χE to the definition of stochastic complexity would yield shorter codes for finite N. The O(1) term, which arises from the determinant of quadratic fluctuations around the saddlepoint, is even more interesting. The determinant of I˜−1 is proportional to the volume of the ellipsoid in parameter space around ˆ where the value of the integrand of the Bayesian posterior is significant.9 2 The scale for determining whether det I˜−1 is large or small is set by the Fisher information on the surface whose determinant defines the volume element. ˜ 1/2 can be understood as measuring Consequently the term ln(det J/ det I) the robustness of the model in the sense that it measures the relative volume of the parameter space that provides good models of the data. More robust models in this sense will be less sensitive to the precise choice of parameters. We also observe from the discussion regarding Jeffreys’ prior that the volˆ is proportional to (det J)−1/2 . So the ume of indistinguishability around 2 ˜ (1/2) is essentially proportional to the ratio Vlarge /Vindist , quantity (det J/ det I) the ratio of the volume where the integrand of the Bayesian posterior is large to the volume of indistinguishability introduced earlier. Essentially a model family is better (more natural or robust) if it contains many distinguishable distributions that are close to the true. Related observations have been made before (MacKay 1992a, 1992b; Clarke and Barron 1990) but without the interpretation in terms of the robustness of a model family. The term ln (2π)d /V can be understood as a preference for models that have a smaller invariant volume in the space of distributions and hence are more constrained. It may be argued that this term is a bit artificial since it can be ill-defined for models with noncompact parameter spaces. In such situations, a regulator would have to be placed on the parameter space to yield a 9 If we fix a fraction f < 1 where f is close to 1, the integrand of the Bayesian posterior will be greater than f times the peak value in an elliptical region around the maximum.
Statistical Inference, Occam’s Razor, and Statistical Mechanics
359
finite volume. The terms proportional to 1/N are less easy to interpret. They involve higher derivatives of the metric on the parameter manifold and of the relative entropy distances between points on the manifold and the true distribution. This suggests that these terms essentially penalize high curvatures of the model manifold, but it is hard to extract such an interpretation in terms of components of the curvature tensor on the manifold. It is worth noting that while terms of O(1) and larger in χE (A) depend at most on the measure (prior distribution) assigned to the parameter manifold, the terms of O(1/N) depend on the geometry via the connection coefficients in the covariant derivatives. For this reason, the O(1/N) terms are the leading probes of the effect that the geometry of the space of distributions has on statistical inference in a Bayesian setting, and so it would be very interesting to analyze them. In summary, Bayesian model family inference embodies Occam’s razor because, for small N, the subleading terms that measure simplicity and robustness will be important, while for large N, the accuracy of the model family dominates. 3.2 Analysis of More General Situations. The asymptotic expansion in equation 3.4 and the subsequent analysis were carried out for the special case of a posterior probability with a single global maximum in the integrand that lay in the interior of the parameter space. Nevertheless, the basic insights are applicable far more generally. First, if the global maximum lies on the boundary of the parameter space, we can account for the portion of the peak that is cut off by the boundary and reach essentially the same conclusions. Second, if there are multiple discrete global maxima, each contributes separately to the asymptotic expansion, and the contributions can be added to reach the same conclusions. The most important difficulty arises when the global maximum is degenerate so that matrix I˜ in equation 3.4 has zero eigenvalues. The eigenvectors corresponding to these zeroes are tangent to directions in parameter space in which the value of the maximum is unchanged up to the second order in perturbations around the maximum. These sorts of degeneracies are particularly likely to arise when the true distribution is not a member of the family under consideration and can be dealt with by the method of collective coordinates. Essentially we would choose new parameters for the model, a subset of which parameterize the degenerate subspace. The integral over the degenerate subspace then factors out of the integral in equation 3.3 and essentially contributes a factor of the volume of the degenerate subspace times terms arising from the action of the differential operator exp[−G(∇h )]. The evaluation of specific examples of this method in the context of statistical inference will be left to future publications. There are situations in which the perturbative expansion in powers of 1/N is invalid. For example, the partition function in equation 2.6 regarded
360
Vijay Balasubramanian
as a function of N may have singularities. These singularities and the associated breakdown of the perturbative analysis of this section would be of the utmost interest since they would be signatures of “phase transitions” in the process of statistical inference. This point will be discussed further in Section 5. 4 The Razor of a Model Family The large N limit of the partition function in equation 2.4 suggests the definition of an ideal theoretical index of the complexity of a parameteric family relative to a given true distribution. We know from equation 2.5 that (−1/N) ln [Pr(E|2)] → D(tk2) + h(t) as N grows large. Now assume that the maximum likelihod estimator is ˆ = arg max2 ln Pr(E|2) converges in probconsistent in the sense that 2 ability to 2∗ = arg min2 D(tk2) as N grows large.10 Also suppose that the log likelihood of a single outcome ln Pr(ei |2) considered as a family of functions of 2 indexed by ei is equicontinuous at 2∗ .11 Finally, suppose that all derivatives of ln Pr(ei |2) with respect to 2 are also equicontinuous at 2∗ . Subject to the assumptions in the previous paragraph, it is easily shown ˆ → D(tk2∗ ) + h(t) as N grows large. Next, using the that (−1/N) ln Pr(E|2) covariant derivative with respect to 2 defined in Section 3, let J˜µ1 ,...,µi = ∇µ1 , . . . , ∇µi D(tk2)|2∗ . It also follows that I˜µ1 ,...,µi → J˜µ1 ,...,µi (Balasubramanian 1996). Since the terms in the asymptotic expansion of (1/N)(χE −Nh(t)) (equation 3.4) are continuous functions of ln Pr(E|2) and its derivatives, they individually converge to limits obtained by replacing each I˜ by J˜ and ˆ by D(tk2∗ ) + h(t). Define (−1/N) ln RN (A) to be the sum (−1/N) ln Pr(E|2) of the series of limits of the individual terms in the asymptotic expansion of (1/N)(χE − Nh(t)): # " det Jij (2∗ ) − ln RN (A) 1 d ∗ = D(tk2 ) + ln N − ln N 2N 2N det J˜µν (2∗ ) " # µ ¶ 1 (2π )d/2 1 − ln +O . N V N2
(4.1)
10 In other words, assume that given any neighborhood of 2∗ , 2 ˆ falls in that neighborhood with high probability for sufficiently large N. If the maximum likelihood estimator is inconsistent, statistics has very little to say about the inference of probability densities. 11 In other words, given any ² > 0, there is a neighborhood of M of 2∗ such that for every ei and 2 ∈ M, | ln Pr(ei |2) − ln Pr(ei |2∗ )| < ².
Statistical Inference, Occam’s Razor, and Statistical Mechanics
361
This formal series of limits can be resummed to obtain R RN (A) =
√ dd 2 Je−ND(tk2) R . √ dd 2 J
(4.2)
We encountered RN (A) in Section 2.2 as the quenched approximation to the partition function in equation 2.4. RN (A) will be called the razor of the model family A. The razor, RN (A), is a theoretical index of the complexity of the model family A relative to the true distribution t given N data points. In a certain sense, the razor is the ideal quantity that Bayesian methods seek to estimate from the data available in a given realization of the model inference problem. Indeed, the quenched approximation to the Bayesian partition function consists precisely of averaging over the data in different realizations. The terms in the expansion of the log razor in equation 4.1 are the ideal analogs of the terms in χE since they arise from derivatives of the relative entropy distance between distributions indexed by the model family and the true distribution. The leading term tells us that for sufficiently large N, Bayesian inference picks the model family that comes closest to the true distribution in relative entropy. The subleading terms have the same interpretations as the terms in χE discussed in the previous section, except that they are the ideal quantities to which the corresponding terms in χE tend when enough data are available. The razor is useful when we know the true distribution as well as the model families being used by a particular system and we wish to analyze the expected behavior of Bayesian inference. It is also potentially useful as a tool for modeling and analysis of the general types of phenomena that can occur in Bayesian inference: different relative entropy distances D(tk2) can yield radically different learning behaviors, as discussed in the next section. The razor is considerably easier to analyze than the full Bayesian posterior probability since the quenched approximation to equation 2.4 given in equation 4.2 defines a statistical mechanics on the space of distributions in which the “disorder” has been averaged out. The tools of statistical mechanics can then be straightforwardly applied to a system with temperature 1/N and energy function D(tk2). 4.1 Comparison with Other Model Selection Procedures. The first two terms in χE (A) have appeared together as the stochastic complexity of a model family relative to the true distribution t (Rissanen 1984, 1986). Equivalently, the first two terms of the razor are the expectation value in t of the stochastic complexity of a model family given N data points. The minimum description length (MDL) approach to model selection consists of picking the model that minimizes the stochastic complexity and hence agrees in leading order with the Bayesian approach of this article.
362
Vijay Balasubramanian
In the context of neural networks, other model selection criteria such as the neural information criterion (NIC) of Murata et al. (1994) and the related effective dimension of Moody (1992) have been used.12 Specializing these criteria to the case of density estimation with a logarithmic loss function, ˆ + k/N, where 2 ˆ is the we find the general form NIC = −(1/N) ln Pr(E|2) maximum likelihood parameter and k is an “effective dimension” computed as described in Moody (1992) and Murata et al. (1994).13 The second term, which is the penalty for complexity, is essentially the expected value of the difference between the log likelihoods of the test set and the training set. In other words, the term arises because on general grounds high-dimensional models are expected to generalize poorly from a small training set. Although there are superficial similarities between χE (A) and the NIC, they arise from fundamentally different analyses. When the true distribution t lies within the parameteric family, k equals d, the dimension of the parameter space (Murata et al. 1994). So in this special case, much as χE (A) converges to the razor, the two terms in the NIC converge in probability to NIC−h(t) → ˜ ∗ ) so that to O(1/N), S = D(tk2∗ )+d/N. In the same special case J(2∗ ) = J(2 the razor is given by (−1/N) ln RN (A) = D(tk2∗ )+d(ln N)/2N, which differs significantly from S. Since χE (A) agrees in leading order with the MDL criterion, models selected using the Bayesian procedure are expected to yield minimal codes for data generated from the true distribution. It is not clear that the NIC has this property. Further analysis is required to understand the advantages of the different criteria. 5 Biophysical Relevance and Some Open Questions The general framework described here is relevant to biophysics if we believe that neural systems optimize their accumulation of information from a statistically varying environment. This is likely to be true in at least some circumstances since an organism derives clear advantages from rapid and efficient detection and encoding of information. For example, see the discussions of Bialek (1991) and Atick (1991) of neural signal processing systems that approach physical and information theoretic limits. A creature such as a fly is faced with the problem of estimating the statistical profile of its environment from the small amount of data available at its photodetectors. The general formalism presented in this article applies to such problems, and an optimally designed fly would implement the formalism subject to the constraints of its biological hardware. In this section we examine several interesting questions in the theory of learning that can
12
I thank an anonymous referee for suggesting the comparison. Murata et al. (1994) consider a situation in which the learning algorithm finds an ˆ rather than 2 ˆ itself, but we will not consider this more general scenario estimator of 2 here. 13
Statistical Inference, Occam’s Razor, and Statistical Mechanics
363
be discussed effectively in the statistical mechanical language introduced here. First, consider the possibility of phase transitions in the disordered partition function that describes the Bayesian posterior probability or in the quenched approximation defining the razor. Phase transitions arise from a competition between entropy and energy, which, in the present context, is a competition between simplicity and accuracy. We should expect the existence of systems in which inference at small N is dominated by “simpler” and more “robust” saddlepoints, whereas at large N more “accurate” saddlepoints are favored. As discussed in Section 3.1, the distributions in the neighborhood of simpler and more robust saddlepoints are more concentrated near the true.14 The transitions between regimes dominated by these different saddlepoints would manifest themselves as singularities in the perturbative methods that led to the asymptotic expansions for χE and ln RN (A). The phase transitions are interesting even when the task at hand is not the comparison of model families but merely the selection of parameters for a given family. In Section 3.1 we interpreted the terms of O(1) in χE as measurements of the robustness or naturalness of a model. These robustness terms can be evaluated at different saddlepoints of a given model, and a more robust point may be preferable at small N since the parameter estimation would then be less sensitive to fluctuations in the data. So far we have concentrated on the behavior of the Bayesian posterior and the razor as a function of the number of data points. Instead we could ask how they behave when the true distribution is changed. For example, this can happen in a biophysical context if the environment sensed by a fly changes when it suddenly finds itself indoors. In statistical mechanical terms, we wish to know what happens when the energy of a system is time dependent. If the change is abrupt, the system will dynamically move between equilibria defined by the energy functions before and after the change. If the change is very slow, we would expect adaptation that proceeds gradually. In the language of statistical inference, these adaptive processes correspond to the learning of changes in the true distribution. A final question that has been touched on but not analyzed in this article is the influence of the geometry of parameter manifolds on statistical inference. As discussed in Section 3.1, terms of O(1/N) and smaller in the asymptotic expansions of the log Bayesian posterior and the log razor depend on details of the geometry of the parameter manifold. It would be very interesting to understand the precise meaning of this dependence.
14 In Section 3.1 I discussed how Bayesian inference embodies Occam’s razor by penalizing complex families until the data justify their choice. Here we are discussing Occam’s razor for choice of saddlepoints within a given family.
364
Vijay Balasubramanian
6 Conclusion In this paper I have cast parameteric model selection as a disordered statistical mechanics on the space of probability distributions. A low-temperature expansion was used to develop the asymptotics of Bayesian methods beyond the analyses available in the literature, and it was shown that Bayesian methods for model family inference embody Occam’s razor. While reaching these results, I derived and discussed a novel interpretation of Jeffreys’ prior density as the uniform prior on the probability distributions indexed by a parameteric family. By considering the large N limit and the quenched approximation of the disordered system implemented by Bayesian inference, we derived the razor, a theoretical index of the complexity of a parameteric family relative to a true distribution. Finally, in view of the analog statistical mechanical interpretation, I discussed various interesting phenomena that should be present in systems that perform Bayesian learning. It is easy to create models that display these phenomena simply by considering families of distributions for which D(tk2) has the right structure. It would be interesting to examine models of known biophysical relevance to see if they exhibit such effects, so that experiments could be carried out to verify their presence or absence in the real world. This project will be left to a future publication. Appendix: Measuring Indistinguishability of Distributions Let us take 2p and 2q to be points on a parameter manifold. Since we are working in the context of density estimation, a suitable measure of the distinguishability of 2p and 2q should be derived by taking N data points drawn from either p or q and asking how well we can guess which distribution produced the data. If p and q do not give very distinguishable distributions, they should not be counted separately since that would count the same distribution twice. Precisely this question of distinguishability is addressed in the classical theory of hypothesis testing. Suppose {e1 , . . . , eN } ∈ EN are drawn iid from one of f1 and f2 with D( f1 k f2 ) < ∞. Let AN ⊆ EN be the acceptance region for the hypothesis that the distribution is f1 and define the error probabilities αN = f1N (ACN ) and βN = f2N (AN ). (ACN is the complement of AN in EN , and f N denotes the product distribution on EN describing N iid outcomes drawn from f .) In these definitions αN is the probability that f1 was mistaken for f2 , and βN is the probability of the opposite error. Stein’s lemma tells us how low we can make βN given a particular value of αN . Indeed, let us define ² = minAN ⊆EN , αN ≤² βN . Then Stein’s lemma tells us that βN 1 ² ln βN = −D( f1 k f2 ) . ²→0 N→∞ N lim lim
(A.1)
Statistical Inference, Occam’s Razor, and Statistical Mechanics
365
By examining the proof of Stein’s lemma (Cover and Thomas 1991), we find that for fixed ² and sufficiently large N, the optimal choice of decision region ² : places the following bound on βN −D( f1 k f2 ) − δN +
1 ln (1 − αN ) ² ≤ ln βN N N
≤ −D( f1 k f2 ) + δN +
ln (1 − αN ) , N
(A.2)
where αN < ² for sufficiently large N. The δN are any sequence of positive constants that satisfy the property that αN =
f1N
¯ ï ! N ¯ ¯1 X f1 (ei ) ¯ ¯ − D( f1 k f2 )¯ > δN ≤ ² ln ¯ ¯ ¯ N i=1 f2 (ei )
(A.3)
P for all sufficiently large N. Now (1/N) N i=1 ln( f1 (ei )/f2 (ei )) converges to D( f1 k f2 ) by the law of large numbers since D( f1 k f2 ) = E f1 (ln( f1 (ei )/f2 (ei )). So for any fixed δ we have: ¯ ! ï N ¯ ¯1 X f (e ) 1 i ¯ ¯ N − D( f1 k f2 )¯ > δ < ² ln f1 ¯ ¯ ¯ N i=1 f2 (ei )
(A.4)
for all sufficiently large N. For a fixed ² and a fixed N, let 1²,N be the collection of δ > 0 that satisfy equation A.4. Let δ²N be the infimum of the set 1²,N . Equation A.4 guarantees that for any δ > 0, for any sufficiently large N, 0 < δ²N < δ. We conclude that δ²N chosen in this way is a sequence that converges to zero as N → ∞ while satisfying the condition in equation A.3 that is necessary for proving Stein’s lemma. We will now apply these facts to the problem of distinguishability of points on a parameter manifold. Let 2p and 2q index two distributions on a parameter manifold and suppose that we are given N outcomes generated independently from one of them. We are interested in using Stein’s lemma to determine how distinguishable 2p and 2q are. By Stein’s lemma, −D(2p k2q ) − δ²N (2q ) +
β ² (2q ) ln(1 − αN ) ≤ N N N ≤ −D(2p k2q ) + δ²N (2q ) +
ln(1 − αN ) , N
(A.5)
² where we have written δ²N (2q ) and βN (2q ) to emphasize that these quantities are functions of 2q for a fixed 2p . Let A = −D(2p k2q )+(1/N) ln(1−αN )
366
Vijay Balasubramanian
be the average of the upper and lower bounds in equation A.5. Then A ≥ −D(2p k2q ) + (1/N) ln(1 − ²) because the δ²N (2q ) have been chosen to satisfy equation A.5. We now define the set of distributions UN = {2q : −D(2p k2q ) + (1/N) ln(1 − ²) ≥ (1/N) ln β ∗ } where 1 > β ∗ > 0 is some fixed constant. Note that as N → ∞, D(2p k2q ) → 0 for 2q ∈ UN . We want to show that UN is a set of distributions that cannot be well distinguished from 2p . The first way to see this is to observe that the average of the upper and ² is greater than or equal to ln β ∗ for 2q ∈ UN . So in lower bounds on ln βN ² exceeds β ∗ for 2q ∈ UN . this loose, average sense, the error probability βN More carefully, note that (1/N) ln(1−αN ) ≥ (1/N) ln(1−²) by choice of the ² (2q ) ≥ (1/N) ln β ∗ − δ²N (2q ). So using equation A.5, we see that (1/N) ln βN δ²N (2q ). Exponentiating this inequality we find that £ ² ¤(1/N) 1 ≥ βN (2q ) ≥ (β ∗ )(1/N) e−δ²N (2q ) .
(A.6)
The significance of this expression is best understood by considering parameteric families in which for every 2q , Xq (ei ) = ln(2p (ei )/2q (ei )) is a random variable with finite mean and bounded variance, in the distribution indexed by 2p . In that case, taking b to be the bound on the variances, Chebyshev’s inequality says that 2pN
¯ ! ï N ¯ ¯1 X b Var(X) ¯ ¯ ≤ 2 . Xq (ei ) − D(2p k2q )¯ > δ ≤ 2 ¯ ¯ ¯ N i=1 δ N δ N
(A.7)
In order to satisy αN ≤ ², it suffices to choose δ = (b/N²)1/2 . So if the bounded variance condition is satisfied, δ²N (2q ) ≤ (b/N²)1/2 for any 2q , and therefore we have the limit limN→∞ sup2q ∈UN δ²N (2q ) = 0. Applying this limit to equation A.6, we find that 1 ≥ lim
inf
N→∞ 2q ∈UN
£
¤(1/N)
² (2q ) βN
≥ 1 × lim
inf e−δ²N (2q ) = 1 .
N→∞ 2q ∈UN
(A.8)
² (2q )](1/N) = 1. This is to In summary, we find that limN→∞ inf2q ∈UN [βN ² be contrasted with the behavior of βN (2q ) for any fixed 2q 6= 2p for which ² (2q )](1/N) = exp −D(2p k2q ) < 1. We have essentially shown limN→∞ [βN that the sets UN contain distributions that are not very distinguishable from ² for distinguishing between 2p . The smallest one-sided error probability βN 2p and 2q ∈ UN remains essentially constant, leading to the asymptotics in equation A.8. Define κ ≡ − ln β ∗ + ln(1 − ²) so that we can summarize the region UN of high probability of error β ∗ at fixed ² as κ/N ≥ D(θp kθq ). In this region, the distributions are indistinguishable from 2p with error probabilities αN ≤ ² ² (1/N) and (βN ) ≥ (β ∗ )(1/N) exp −δ²N .
Statistical Inference, Occam’s Razor, and Statistical Mechanics
367
Acknowledgments I thank Kenji Yamanishi for several fruitful conversations. I have also had useful discussions and correspondence with Steve Omohundro, Erik Ordentlich, Don Kimber, Phil Chou, and Erhan Cinlar. Finally, I am grateful to Curt Callan for supporting and encouraging this project and to Phil Anderson for helping with travel funds to the 1995 Workshop on Maximum Entropy and Bayesian Methods. This work was supported in part by DOE grant DE-FG02-91ER40671. References Amari, S. I. 1985. Differential Geometrical Methods in Statistics. Springer-Verlag, Berlin. Amari, S. I., Barndorff-Nielsen, O. E., Kass, R. E., Lauritzen, S. L., and Rao, C. R. 1987. Differential Geometry in Statistical Inference, vol. 10. Lecture NoteMonograph Series. Institute of Mathematical Statistics, Hayward, CA. Atick, J. J. 1991. Could information theory provide an ecological theory of sensory processing? In Princeton Lectures in Biophysics, W. Bialek, ed., pp. 223–291. World Scientific, Singapore. Balasubramanian, V. 1996. A geometric formulation of Occam’s razor for inference of parameteric distributions. Available as preprint number adaporg/9601001 from http://xyz.lanl.gov/ and as Princeton University Physics Preprint PUPT-1588. Barron, A. R. 1985. Logically smooth density estimation. Ph.D. thesis, Stanford University. Barron, A. R., and Cover, T. M. 1991. Minimum complexity density estimation. IEEE Transactions on Information Theory 37(4), 1034–1054. Bialek, W. 1991. Optimal signal processing in the nervous system. In Princeton Lectures in Biophysics, W. Bialek, ed., pp. 321–401. World Scientific, Singapore. Clarke, B. S., and Barron, A. R. 1990. Information-theoretic asymptotics of Bayes methods. IEEE Transactions on Information Theory 36(3), 453–471. Cover, T. M., and Thomas, J. A. 1991. Elements of Information Theory. Wiley, New York. Itzykson, C., and Drouffe, J. M. 1989. Statistical Field Theory. Cambridge University Press, Cambridge. Jeffreys, H. 1961. Theory of Probability. 3d ed. Oxford University Press, New York. Lee, P. M. 1989. Bayesian Statistics: An Introduction. Oxford University Press, New York. Ma, S. K. 1985. Statistical Mechanics. World Scientific, Singapore. MacKay, D. J. C. 1992a. Bayesian interpolation. Neural Computation 4(3), 415–447. MacKay, D. J. C. 1992b. A practical Bayesian framework for backpropagation networks. Neural Computation 4(3), 448–472. Moody, J. 1992. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. Proceedings of NIPS 4, 847–854.
368
Vijay Balasubramanian
Murata, N., Yoshizawa, S., and Amari, S. 1994. Network information criterion— determining the number of hidden units for artificial neural network models. IEEE Trans. Neural Networks 5, 865–872. Potters, M., and Bialek, W. 1994. Statistical mechanics and visual signal processing. Journal de Physique I 4(11), 1755–1775. Rissanen, J. 1984. Universal coding, information, prediction and estimation. IEEE Transactions on Information Theory 30(4), 629–636. Rissanen, J. 1986. Stochastic complexity and modelling. Annals of Statistics 14(3), 1080–1100. de Ruyter van Steveninck, R. R., Bialek, W., Potters, M., Carlson, R. H., and Lewen, G. D. 1995. Adaptive movement computation by the blowfly visual system. Princeton Lectures in Biophysics. Wallace, C. S., and Freeman, P. R. 1987. Estimation and inference by compact coding. Journal of the Royal Statistical Society 49(3), 240–265. Yamanishi, K. 1995. A decision-theoretic extension of stochastic complexity and its applications to learning. Submitted for publication.
Received January 16, 1996; accepted April 24, 1996.
Communicated by Steven Nowlan
Bias/Variance Analyses of Mixtures-of-Experts Architectures Robert A. Jacobs Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627 USA
This article investigates the bias and variance of mixtures-of-experts (ME) architectures. The variance of an ME architecture can be expressed as the sum of two terms: the first term is related to the variances of the expert networks that comprise the architecture and the second term is related to the expert networks’ covariances. One goal of this article is to study and quantify a number of properties of ME architectures via the metrics of bias and variance. A second goal is to clarify the relationships between this class of systems and other systems that have recently been proposed. It is shown that in contrast to systems that produce unbiased experts whose estimation errors are uncorrelated, ME architectures produce biased experts whose estimates are negatively correlated. 1 Introduction Many researchers recently have studied statistical models that estimate the value of a random variable by combining the estimates of other models, henceforth referred to as experts. Theoretical and experimental results have established that when the experts are unbiased estimators, combination procedures are most effective when the experts’ estimates are negatively correlated; they are moderately effective when the experts are uncorrelated and only mildly effective when the experts are positively correlated. As an example of an analysis of combinations that use unbiased experts, Clemen and Winkler (1985) quantified the utility of positively correlated, uncorrelated, and negatively correlated experts when the experts’ estimates are combined via Bayes’ rule. These investigators represented the statistical dependence among the experts by the dependence among their estimation errors and then considered the number of independent experts that carry the same amount of information as a given number of dependent experts. Given certain assumptions, Clemen and Winkler showed that the number of independent experts that are worth the same as an infinite number of positively correlated experts is equal to the inverse of the correlation among the experts. This limit is surprisingly low. If, for instance, the correlation among the experts is 0.5, then an infinite number of such experts carry as much information as only two independent experts. After the first expert, which is worth one independent expert, all other experts combined are worth only Neural Computation 9, 369–383 (1997)
c 1997 Massachusetts Institute of Technology °
370
Robert A. Jacobs
one additional independent expert. The opposite situation is found when negative dependence among the experts is considered. That is, whereas positive dependence among the experts diminishes the effectiveness of a combination procedure, negative dependence increases it. As a consequence of these types of theoretical results, as well as of empirical studies that show that these general findings also hold in practice (Perrone 1993; Hashem 1993), several researchers have investigated architectures and training procedures for obtaining unbiased experts whose estimation errors are uncorrelated (Meir 1994; Raviv and Intrator, in press; Tresp and Taniguchi 1995). In contrast, this article studies a class of architectures, known as mixtures-of-experts (ME) architectures, that adopt a very different strategy. The analyses are conducted by estimating the bias and variance of these models under a variety of conditions. Based on a result due to Meir (1994), it is shown that the architecture’s variance can be expressed as the sum of two terms: the first related to the variances of the experts that comprise the architecture and the second related to the experts’ covariances. One goal of this article is to study and quantify a number of properties of ME architectures via the metrics of bias and variance. Because these systems consist of several interacting components, their performance properties can be difficult to understand in a rigorous way. The analyses presented here are useful in this regard. A second goal is to clarify the relationships between this class of architectures and other architectures that have recently been proposed. Instead of producing unbiased experts whose estimation errors are uncorrelated, ME architectures produce biased experts whose estimates are negatively correlated. The article is organized as follows. Section 2 briefly overviews the ME architecture and its associated learning procedure. Section 3 presents the equations for computing the bias and variance of these architectures. Section 4 presents the results of estimating the biases and variances of ME architectures operating with different numbers of components, operating with different parameter settings, and operating in different noise environments. 2 Mixtures-of-Experts Architectures The architectures studied in this article are members of the ME family of architectures. This family is of interest on both theoretical and empirical grounds. From a theoretical viewpoint, the architectures combine aspects of finite mixture models and generalized linear models, two well-studied statistical frameworks (Jordan and Jacobs 1994; McCullagh and Nelder 1989; McLachlan and Basford 1988). From an empirical viewpoint, they have been shown to be capable of comparatively fast learning and good generalization on a wide variety of regression and classification tasks (Jacobs et al. 1991; Jordan and Jacobs 1994; Nowlan and Hinton 1991; Waterhouse and Robinson 1994).
Bias/Variance Analyses of Mixtures-of-Experts Architectures
371
ME architectures are multinetwork, or modular, architectures that combine aspects of competitive and associative learning. Different architectures within this framework may be formed by placing the networks in different structural arrangements. A probabilistic interpretation exists such that for each arrangement of networks, there is a corresponding likelihood function that characterizes an architecture’s performance. Learning occurs by maximizing the likelihood function. ME architectures attempt to solve problems using a “divide-and-conquer” strategy; that is, complex problems are decomposed into a set of simpler subproblems. It is assumed that the data can be adequately summarized by a collection of functions, each defined over a local region of the input space. ME architectures adaptively partition the input space into possibly overlapping regions and allocate different networks to summarize the data located in different regions. This section briefly overviews the ME architectures used in this article. More extensive discussions of ME architectures can be found in Jacobs et al. (1991), Jacobs and Jordan (1991, 1993), Jordan and Jacobs (1994), Jordan and Xu (1993), Nowlan and Hinton (1991), and Peng et al. (1996). For the purposes of this article, it is assumed that the data are generated by a number of different probabilistic rules. Let D = {(x(t) , y(t) )} denote the collection of training data, where x is an input vector, y is a target output, and t is a time index. At each time step, a rule is selected from a condi(t) tional multinomial distribution with probability g(t) i = p(i|x , V), where i indexes a rule and V = [v1 , . . . , vI ] is the matrix of parameters underlying the distribution. The selected rule generates an output y with probability p(y(t) |x(t) , Ui , 8i ), where Ui is a parameter matrix and 8i represents other (possibly nuisance) parameters. In the case of regression, each rule is characterized by a statistical model of the form y = fi (x) + ²i , where fi (x) is a fixed linear function of the input vector, and ²i is a random variable. If it is assumed that ²i is gaussian with a mean of zero and a variance of σi2 , and if it is assumed that the data are independent and identically distributed, then the likelihood of generating the data is proportional to the finite mixture density L(Θ|D) ∝
YX t
∝
(2.1)
i
YX t
p(i|x(t) , V)p(y(t) |x(t) , Ui , 8i )
i
g(t) i
1 − 2σ12 e i σi
2 (y(t) −µ(t) i )
(2.2)
(t) T where µ(t) i = fi (x ), and Θ = [v1 , . . . , vI , U1 , . . . , UI , 81 , . . . , 8I ] is the matrix of all parameters. The ME architectures studied here consist of two types of networks: a gating network and a number of expert networks. The gating network models the input-dependent multinomial distribution used to select a rule. The
372
Robert A. Jacobs
expert networks model the input-dependent statistical models associated with the different rules. At each time step, all networks receive the vector x as input. The output of expert network i, denoted µi , is a linear function of its input. The outputs of the gating network are computed using the “softmax” function (Bridle 1989); specifically, the activation of the ith output unit of the gating network, denoted gi , is gi =
esi /τ , I X sj /τ e
(2.3)
j=1
where τ is a temperature parameter, si denotes the weighted sum of unit i’s inputs, and I denotes the number of expert networks. The output of the architecture as a whole, given by µ=
I X
gi µi ,
(2.4)
i=1
is a convex combination of the experts’ outputs. The parameters of the gating and experts networks are adapted so as to maximize the likelihood function given in equation 2.2. This maximization is performed using the expectation-maximization (EM) algorithm (see Jordan and Jacobs 1994 for the EM equations). 3 Bias/Variance Measures The expected squared error of an estimator on a particular data item may be expressed as the sum of two terms, E[(y − µ)2 |x, D] = E[(y − E[y|x])2 | x, D] + (µ − E[y | x]|x, D)2 ,
(3.1)
where x is the input vector, y is the target output, µ is the estimator’s approximation, D is the set of training data used to train the estimator, and E is the expectation operator taken with respect to the probability distribution p(y|x). Because the first term does not depend on the data, only the second term is considered below. The expected value of the second term with respect to the data can also be written as the sum of two terms, ED [(µ − E[y|x])2 ] = (ED [µ] − E[y|x])2 + ED [(µ − ED [µ])2 ] ,
(3.2)
where ED is the expectation operator taken with respect to the data. The first term is the square of the bias of an estimator, and the second term is the estimator’s variance.
Bias/Variance Analyses of Mixtures-of-Experts Architectures
373
ME architectures produce an estimate of a target output by linearly combining the estimates of several experts. Consequently, the bias and variance of an ME architecture can be written in terms of the gating network outputs (the linear coefficients) and the expert network outputs. This leads to an interesting decomposition in the case of the variance of an ME architecture. Based on a result due to Meir (1994), the variance may be expressed as the sum of two terms, " # X 2 2 (gi µi − ED [gi µi ]) (3.3) ED [(µ − ED [µ]) ] = ED i
XX (gi µi − ED [gi µi ]) (gj µj − ED [gj µj ]) , + ED
i
i6= j
where the first term is the variance of the weighted outputs of the individual experts, and the second term is the covariance of the experts’ weighted outputs. Combining equations 3.2 and 3.3, the expected squared difference between the estimate of an ME architecture and the expected value of a regression function is the sum of the architecture’s squared bias and the variance and covariance of the experts’ weighted outputs. It is possible to gain insight into the performance characteristics of ME architectures by analyzing these systems in terms of these quantities. 4 Simulation Results This section reports the results of estimating the bias of ME architectures and the variance and covariance of experts’ weighted outputs on a regression task. The regression function is " # ¶ µ 1 2 1 10 sin(π x1 x2 ) + 20 x3 − + 10 x4 + 5 x5 − 1 , f (x) = 13 2
(4.1)
where x = [x1 , . . . , x5 ] is an input vector whose components lie between zero and one. The value of f (x) lies in the interval [−1, 1]. Target outputs are created by adding noise sampled from a gaussian distribution with a mean of zero and a variance of σ 2 to the function f . This regression task is a linearly scaled version of a task that has been used by other investigators to evaluate statistical estimators (e.g., Friedman 1991; Friedman et al. 1983). Twenty-five training sets were created. Each set consisted of 500 inputoutput patterns in which the components of the input vectors were independently sampled from a uniform distribution over the interval (0, 1). A test set of 1024 input-output patterns was also created. For this set, the input vectors were uniformly spaced in the five-dimensional input space, and the target
374
Robert A. Jacobs
outputs were not corrupted by noise. Twenty-five simulations of each architecture were conducted. In each simulation, the architecture was trained on a different training set. Simulations lasted for 60 epochs, and architectures were evaluated on the test set at every fifth epoch. The variance parameter associated with each expert network was set to the variance of the noise added to the training data. An architecture was initialized with the same set of small, random weights at the start of each simulation. Consequently, different simulations of an architecture yielded different performances solely due to the use of different training sets. The equations for estimating the quantities of interest are closely related to the equations for estimating the bias and variance of neural networks presented by Geman et al. (1992). The average output of an architecture on pattern t from the test set, denoted µ(t) , is given by µ(t) =
N 1 X µ(t,n) , N n=1
(4.2)
where µ(t,n) is the output on the nth simulation and N is the number of simulations. Similarly, the average weighted output of expert network i on (t) pattern t, denoted g(t) i µi , is given by (t) g(t) i µi =
N 1 X g(t,n) µ(t,n) , i N n=1 i
(4.3)
is the activation of the ith unit of the gating network on simulawhere g(t,n) i tion n, and µ(t,n) is the output of expert network i on simulation n. Define the i integrated bias, meaning the squared bias averaged over the set of inputoutput patterns in the test set, to be Integrated bias ≡
T 1X |µ(t) − y(t) |2 , T t=1
(4.4)
where y(t) is the target output on pattern t and T is the total number of patterns. The integrated variance and integrated covariance of the experts’ weighted outputs are defined in analogous ways: Integrated variance ≡
T N X1X 1 X (t) 2 (g(t,n) µ(t,n) − g(t) i µi ) i i i T t=1 N n=1
(4.5)
Integrated covariance ≡ T N XX 1 X 1 X (t) (t,n) (t,n) (g(t,n) µ(t,n) − g(t) µj − gj(t) µj(t) ) . i µi ) (gj i i T N i j6=i t=1 n=1
(4.6)
Bias/Variance Analyses of Mixtures-of-Experts Architectures
375
Figure 1: Integrated bias, variance, covariance, and mean-squared error of architectures with 4, 8, or 12 expert networks.
It is also useful to define the integrated mean-squared error (MSE): Integrated MSE ≡
T N 1 X 1X |µ(t,n) − y(t) |2 . T t=1 N n=1
(4.7)
Equivalently, the integrated MSE can be expressed as the sum of the integrated bias, variance, and covariance. The first experiment that we conducted evaluated ME architectures with different numbers of expert networks. Systems with 4, 8, and 12 experts were used. The temperature parameter in the softmax activation function used by the gating network (see equation 2.3) was set to 1, and the variance of the noise added to the target function was set to 0.1. The results are presented in Figure 1. The training time is given by the horizontal axis of each graph in this figure. The integrated bias is given by the vertical axis of the upper-left graph; the integrated variance is given by the vertical axis of the upper-right graph; the lower-left graph gives the integrated covariance; and the lowerright graph gives the integrated MSE. In all four graphs, the solid line gives the results for the architecture with 12 experts; the dense dashed line and the sparse dashed line give the results for the architectures with 8 and 4 experts, respectively (a legend is shown in the upper-left graph). As expected, the integrated bias declined with training time for all architectures. The 12-expert and 8-expert systems have more computational resources than the 4-expert system and thus achieved smaller bias. The in-
376
Robert A. Jacobs
Figure 2: Correlations among the expert networks’ outputs for architectures with 4, 8, and 12 experts at epochs 20 and 60.
tegrated variance for all architectures increased with training time. This is particularly true for the system with 12 experts. In contrast, the covariance for all architectures decreased as training progressed. Because the variance increased significantly faster than the covariance decreased, the systems eventually overfit the training data as evidenced by the integrated MSE. Overfitting was most severe for the 12-expert system. For our purposes, it is important to note that the integrated covariance became negative during the course of training. Although this covariance is based on the expert networks’ weighted outputs (as weighted by the gating network), its negative value suggests that it is likely that the experts’ unweighted outputs also became negatively correlated. In order to evaluate this hypothesis, we measured the correlations among these outputs (averaged over the 25 simulations of each architecture) at epoch 20 (prior to overfitting) and at epoch 60 (overfitting has occurred). The results are shown in Figure 2. The three rows of bar graphs correspond to the systems with 4, 8, and 12 expert networks; the two columns correspond to epochs 20 and 60. The vertical axis of each graph gives the value of a correlation. A correlation is represented by a vertical bar with one end point at zero and the other end point at the value of the represented correlation.
Bias/Variance Analyses of Mixtures-of-Experts Architectures
377
Figure 3: Squared biases of the individual expert networks that comprise architectures with 4, 8, and 12 experts at epochs 20 and 60.
The results suggest that the ME training procedure leads to negatively correlated experts. In addition, the correlations become more negative as training proceeds. The negative correlations stem from the fact that each ME system adaptively partitions the input space into regions such that the target function has different properties in each region. Different experts learn the data located in different regions. Because the correlations for the 4-expert system, which has relatively few computational resources, were all negative, it may be said that the ME architecture was efficient in the sense that it adapted the experts so that different experts provided informationally different “basis” functions. Although some of the correlations of the 8- and 12-expert systems were positive in value, the majority of the correlations were negative. The ME architecture tends to produce negatively correlated experts and thus should be preferable to systems that produce uncorrelated and unbiased experts only if its experts are also unbiased. To evaluate this possibility, we estimated the squared bias of the individual experts. The results are shown in Figure 3. The bar graphs in this figure correspond to the sys-
378
Robert A. Jacobs
tems with 4, 8, and 12 expert networks evaluated at epochs 20 and 60. The vertical axis of each graph gives the estimated bias of an individual expert, and different bars correspond to different experts. Despite the fact that the bias of an ME architecture was nearly zero, the squared biases of the individual experts comprising the architecture were positive. The biases of experts in larger architectures were generally greater than those of experts in smaller architectures. Furthermore, the expert biases increased with additional training. The experiments reported here help clarify the relationships between the performance characteristics of ME architectures and those of some systems studied by other researchers. As noted above, several investigators are seeking training methods for achieving unbiased experts whose estimation errors are uncorrelated (Meir 1994; Raviv and Intrator, in press; Tresp and Taniguchi 1995). As just one example, Raviv and Intrator (in press) use a noisy sampling technique (bootstrapping) in order to create sets of independent training samples. Uncorrelated experts are produced by training different experts on different training sets. In contrast, ME architectures and their associated training procedures adopt a very different strategy. They tend to produce experts that are biased and negatively correlated. The next experiment studied the effects of varying the temperature parameter in the softmax function used by the gating network. Jordan and Jacobs (1994) claimed that an advantage of ME architectures over other treestructured statistical estimators, such as CART (Breiman el al. 1984) or C4.5 (Quinlan 1993), is that they use soft splits of the input space instead of hard splits, meaning that regions of the input space defined by the architecture can overlap and that data items can lie simultaneously in multiple regions. Estimators that use hard splits are likely to have a larger variance than estimators that use soft splits. In order to verify and quantify this claim, we compared the performances of an ME architecture when the temperature was small (τ = 0.025) to that when the temperature was large (τ = 1.0). The splits of an architecture with a near-zero temperature are more like hard splits, especially during the early stages of training when the gating network’s weights are small. The architecture that was used had 8 expert networks. The variance of the noise added to the target function was 0.1. The results are presented in Figure 4 (this figure has the same format as Fig. 1). The solid line gives the performance of the architecture with a large temperature; the dashed line is for the case of a small temperature. The integrated bias of the large-temperature architecture decreased more slowly than that of the small-temperature architecture during the early stages of training but eventually reached a lower value. Its integrated variance was significantly lower, and its integrated covariance was only moderately higher. Thus, these outcomes are consistent with the claim of Jordan and Jacobs (1994). The MSE of the large-temperature system was smaller. Overall, the results suggest that the system with a low temperature tended to commit early in the training process to a particular partition of the input
Bias/Variance Analyses of Mixtures-of-Experts Architectures
379
Figure 4: The integrated bias, variance, covariance, and mean-squared error for architectures when the temperature parameter τ was set to 1.0 or 0.025.
space. Because the architecture at this stage of training is highly sensitive to the idiosyncrasies of both its training data and its initial weight settings, the selected partitions frequently lead to poor generalization performance. We have also analyzed architectures when the variance of the noise added to the target function was varied. Large noise (σ 2 = 0.2), moderate noise (σ 2 = 0.1), and small noise (σ 2 = 0.05) conditions were studied. The ME architecture had 8 expert networks. The temperature τ was set to 1. The results are shown in Figure 5. The solid line is for the large noise condition; the dense and sparse dashed lines correspond to the moderate noise and small noise conditions, respectively. As expected, the integrated bias of the architecture decreased more slowly when the noise was relatively large. The architecture’s integrated variance grew slowly during early stages of training and more rapidly during later stages under this condition. Similarly, its covariance decreased slowly initially but then eventually decreased rapidly in the large noise case. The integrated MSE is smallest for the low noise case. In sum, performance was best when the signal-to-noise ratio of the training data was high and degraded gracefully with decreasing values of this ratio. 5 Conclusions This article has investigated the bias and variance of several ME architectures. It was shown that the variance of an ME architecture can be expressed as the sum of two terms: the first is related to the variances of the experts that
380
Robert A. Jacobs
Figure 5: The integrated bias, variance, covariance, and mean-squared error for architectures when the variance of the noise added to the target function was 0.2, 0.1, or 0.05.
comprise the architecture, the second is related to the experts’ covariances. One goal of this article was to study and quantify a number of properties of ME architectures. A second goal was to clarify the relationships between this class of systems and other systems that have recently been proposed. In contrast to systems that produce unbiased experts whose estimation errors are uncorrelated, ME architectures produce biased experts whose estimates are negatively correlated. The fact that ME architectures produce biased experts may be seen as a disadvantage by some readers. My view is that this property is of no consequence so long as the overall ME architecture is unbiased at the end of training. This appears to be the case, as is evidenced in Figures 1, 4, and 5. Nonetheless, it may be tempting to seek ways to modify the architecture or training procedure so as to produce individual experts with less bias. The most obvious possibility is to add hidden units to the expert networks so that these networks could perform nonlinear regressions. Although this practice would reduce the bias of the individual experts, I do not necessarily recommend it. One drawback of adding hidden units to the experts is that the EM algorithm, extremely efficient for optimizing likelihood functions, could no longer be easily used during training. A second drawback of adding hidden units is that this would result in increases in the variances of the individual experts. The statistical community has generally regarded nonzero variance as a more serious problem than nonzero bias as evidenced
Bias/Variance Analyses of Mixtures-of-Experts Architectures
381
by the community’s focus on the overfitting problem. It is often found that the addition of a small amount of bias to an estimator can result in a larger decrease in an estimator’s variance. The analyses presented here help in understanding a number of regularization procedures that have recently been applied to ME architectures in order to ameliorate the problem of overfitting. One way that overfitting can be lessened is through the use of methods that decrease the variances of the experts. Waterhouse et al. (1996) pursued this approach in a Bayesian framework by assigning prior distributions to the weights of each expert. The parameter values for each prior distribution were estimated from the data. This strategy aims to decrease the experts’ variances without overly increasing the architecture’s bias. Simulation results on two artificial data sets and on a sunspot prediction task showed that the method can be effective in eliminating overfitting. The experts’ variances can also be decreased by methods that reduce the number of free parameters in an ME architecture. Jacobs et al. (1996) used Bayesian sampling techniques in order to detect and eliminate unnecessary expert networks. Simulation results on a speech recognition task and a breast cancer classification task showed that the method led to improved generalization performances. Waterhouse and Robinson (1996) proposed an algorithm that both adds and deletes networks from an ME architecture during the course of training. A different approach to ameliorating overfitting is to increase the degree to which experts are negatively correlated. Although we are not aware of any studies that have pursued this approach, simulation results of Jacobs and Kosslyn (1994) suggest that it might be achieved through the use of expert networks that each have a different topology or each receive a different set of input variables. Acknowledgments I thank M. Jordan for commenting on an earlier version of this article. This work was conducted while I was a visiting scholar at Stanford University. I am grateful to D. Rumelhart and the members of his research laboratory for many interesting discussions. This work was supported by NIH research grant R29-MH54770. References Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. 1984. Classification and Regression Trees. Wadsworth International Group, Belmont, CA. Bridle, J. 1989. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing: Algorithms, Architectures, and Applications, F. Fogelman-Soulie and J. H´erault, eds. Springer-Verlag, New York. Clemen, R. T., and Winkler, R. L. 1985. Limits for the precision and value of information from dependent sources. Operations Research 33, 427–442.
382
Robert A. Jacobs
Friedman, J. H. 1991. Multivariate adaptive regression splines. Annals of Statistics 19, 1–141. Friedman, J. H., Grosse, E., and Stuetzle, W. 1983. Multidimensional additive spline approximation. SIAM Journal of Scientific Computing 4, 291–301. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Computation 4, 1–58. Hashem, S. 1993. Optimal linear combinations of neural networks. Tech. Rep. SMS 94–4, School of Industrial Engineering, Purdue University. Jacobs, R. A., and Jordan, M. I. 1991. A competitive modular connectionist architecture. In Advances in Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds. Morgan Kaufmann, San Mateo, CA. Jacobs, R. A., and Jordan, M. I. 1993. Learning piecewise control strategies in a modular neural network architecture. IEEE Transactions on Systems, Man, and Cybernetics 23, 337–345. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixtures of local experts. Neural Computation 3, 79–87. Jacobs, R. A., and Kosslyn, S. M. 1994. Encoding shape and spatial relations: The role of receptive field size in coordinating complementary representations. Cognitive Science 18, 361–386. Jacobs, R. A., Peng, F., and Tanner, M. A. 1996. A Bayesian approach to model selection in hierarchical mixtures-of-experts architectures. Neural Networks. In press. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures-of-experts and the EM algorithm. Neural Computation 6, 181–214. Jordan, M. I., and Xu, L. 1993. Convergence results for the EM approach to mixturesof-experts architectures. Tech. Rep. 9303, Department of Brain and Cognitive Sciences, MIT. McCullagh, P., and Nelder, J. A. 1989. Generalized Linear Models. Chapman and Hall, London. McLachlan, G. J., and Basford, K. E. 1988. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York. Meir, R. 1994. Bias, variance, and the combination of estimators: The case of linear least squares. Tech. Rep. 922, Department of Electrical Engineering, Technion, Haifa, Israel. Nowlan, S. J., and Hinton, G. E. 1991. Evaluation of adaptive mixtures of competing experts. In Advances in Neural Information Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds. Morgan Kaufmann, San Mateo, CA. Peng, F., Jacobs, R. A., and Tanner, M. A. 1996. Bayesian inference in mixturesof-experts and hierarchical mixtures-of-experts models with an application to speech recognition. Journal of the American Statistical Association, in press. Perrone, M. P. 1993. Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization. Ph.D. thesis, Brown University. Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.
Bias/Variance Analyses of Mixtures-of-Experts Architectures
383
Raviv, Y., and Intrator, N. In press. Bootstrapping with noise: An effective regularization technique. Tresp, V., and Taniguchi, M. 1995. Combining estimators using non-constant weighting functions. In Advances in Neural Information Processing Systems 7, G. Tesauro, D. S. Touretzky, and T. K. Leen, eds. MIT Press, Cambridge, MA. Waterhouse, S. R., MacKay, D., and Robinson, T. 1996. Bayesian methods for mixtures of experts. In Advances in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, eds. MIT Press, Cambridge, MA. Waterhouse, S. R., and Robinson, A. J. 1994. Classification using hierarchical mixtures of experts. In Proceedings of the IEEE Workshop on Neural Networks for Signal Processing, J. Vloutzos, J.-N. Hwang, and E. Wilson, eds. IEEE Press, New York. Waterhouse, S. R., and Robinson, A. J. 1996. Constructive algorithms for hierarchical mixtures of experts. In Advances in Neural Information Processing Systems 8, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, eds. MIT Press, Cambridge, MA.
Received September 5, 1995; accepted April 24, 1996.
Communicated by Ramamohan Paturi
The Behavior of Forgetting Learning in Bidirectional Associative Memory Chi Sing Leung Lai Wan Chan Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Forgetting learning is an incremental learning rule in associative memories. With it, the recent learning items can be encoded, and the old learning items will be forgotten. In this article, we analyze the storage behavior of bidirectional associative memory (BAM) under the forgetting learning. That is, “Can the most recent k learning item be stored as a fixed point?” Also, we discuss how to choose the forgetting constant in the forgetting learning such that the BAM can correctly store as many as possible of the most recent learning items. Simulation is provided to verify the theoretical analysis. 1 Introduction Associative memory is a major class of neural networks with a wide range of applications, such as content-addressable memory and pattern recognition (Kohonen 1972; Palm 1980). An important feature of associative memory is its associative nature, that is, the ability to recall the stored item based on partial or noisy information. Associative memory encodes and recalls library patterns or pattern pairs. If it encodes single patterns, it is called autoassociative memory. If it encodes pattern pairs, it is called heteroassociative memory. One form of heteroassociative memory is the bivalent additive bidirectional associative memory (BAM) (Kosko 1988). The BAM is used to store bipolar library pairs (Xh , Yh ), where Xh = (x1h , . . . , xnh )T , Yh = (y1h , . . . , yph )T , h = 1, . . . , m, and m is the number of library pairs. The components xih and yjh are bipolar: either +1 or −1. There are two layers in the BAM: layer FX and layer FY , with n and p neurons, respectively. The connection matrix W, proposed by Kosko, is
W=
m X
Yh XhT .
(1.1)
h=1
Neural Computation 9, 385–401 (1997)
c 1997 Massachusetts Institute of Technology °
386
Chi Sing Leung and Lai Wan Chan
The retrieval process is an iterative feedback process that starts with a stimulus vector X(0) in FX : h i Y(v+1) = sgn WX(v) , h i (1.2) X(v+1) = sgn W T Y(v+1) , where sgn is defined as +1 −1 sgn(x) = state unchanged
x>0 x<0 x = 0.
Kosko proved that for any real connection matrix, one of fixed points (X f , Y f ) can be obtained from this iterative process. Apparently such a fixed point is desired to be one of the library pairs. A fixed point has the following properties: X f = sgn(WY f ) and Y f = sgn(W T X f ).
(1.3)
Hence, a library pair can be retrieved only if it is a fixed point. With the iterative process, the BAM can achieve both heteroassociative and autoassociative data recollections. The final state in FX represents the autoassociative recall. The final state in FY represents the heteroassociative recall. The storage behavior of the BAM under Kosko’s encoding method and its variants has been studied by various researchers (Haines and Nielsen 1988; Leung et al. 1995; Leung 1993; Wang et al. 1990). An advantage of Kosko’s encoding method is the ability of incremental learning: encoding new library pairs to the model based on only the current connection matrix. However, with Kosko’s encoding method, the BAM can only correctly store up to min(n, p)/2 log min(n, p) library pairs. When the total number of library pairs exceeds that value, all library pairs, including the old and new items, may not be stored as fixed points. To avoid this matter, we may require that the BAM has the ability to forget; that is, the model should be able to create space for the new library pairs. In modeling the forgetting behavior of the BAM, we would like to have three physiological requirements: 1. Locality. Each weight should be updated based on the values of the corresponding connected neurons. 2. Being incremental. The updated rule is based on only the current connection matrix and is independent of which library pairs have been stored. 3. Boundedness. The magnitude of the weights should be bounded. To achieve these requirements, we introduce a decay factor α f ∈ (0, 1), called the forgetting constant, in the original Kosko’s encoding method, W(t) = α f W(t−1) + Yt XtT ,
(1.4)
The Behavior of Forgetting Learning in Bidirectional Associative Memory 387
where W(0) is a zero matrix and (Xt , Yt ) is the new library pair that we want to encode into the BAM. Equation 1.4 represents a simple form of forgetting learning. It is the discrete-time version of the original continuous-time forgetting rule in Kokso (1992). According to the decay factor, the forgetting learning can encode the recent library pairs as fixed points with a high chance and can delete some old stored library pairs from the model. Also, with our suggestion on α f (see Section 3), the magnitude of the weights is up√ per bounded by O(n/log n) in the worst case and upper bounded by O( n) in the probabilistic sense (see Section 4). However, similar to some conventional backpropagation rules, real-valued computations are involved in equation 1.4. The hardware implementation of equation 1.4 is more difficult than that of Kosko’s encoding scheme. We will use simulation to investigate the behavior of equation 1.4 using the fixed-point computation machine described in Section 3. The storage behavior of a Hopfield network under some similar forgetting rules was studied numerically or theoretically by Nadal et al. (1986), Hemmen et al. (1988), and Parisi (1986). They showed that the previous λn library patterns can be stored in a Hopfield network when a small number of errors are allowed in the recalling patterns, where n is the number of neurons in the Hopfield network and λ is a positive constant less than one. (For a brief review, see Hemmen and Kuhn 1991). The learning within bounds (Hopfield 1982) was proposed for the Hopfield network and studied numerically by Parisi (1986). Its updating rule is ³ ´ (1.5) W(t) = φ W(t−1) + Xt XtT , √ √ where φ(x) = x for |x| < A n; otherwise φ(x) = sgn(x)A n. Since we are now discussing the Hopfield model, the second term of the right-hand side in equation 1.5 is the outer product of the library pattern itself. Let k0 be the number of most recently stored patterns that are satisfactorily kept in the memory. Parisi (1986) numerically showed that k0 can be proportional to n when a small number of errors are allowed in the retrieved items. Using the nonrigorous replica method, (Hemmen et al. 1988) showed that with a suitable constant A (≈ 0.35) the recent 0.04n previous library patterns can be kept in the memory when a small number of errors are allowed in the retrieved items. Since learning within bounds involves only integer-valued weights, its hardware implementation is easier than that of equation 1.4. However, the replica method for averaging over the disorder in the system due to the different possible realizations of nominal patterns is strictly valid only for a temperature T0 > 0 (Forrest and Wallace 1991). In fact, the “T0 = 0” in Hemmen et al. (1988) violates this criterion. It should also be noted that for the BAM, the learning within bounds can be used only when p = n. To our knowledge, for the BAM the selection role about the constant A in the learning within bounds has not been reported when p 6= n.
388
Chi Sing Leung and Lai Wan Chan
In Nadal et al. (1986), the marginalist rule was proposed for the Hopfield network. Its updating rule is W(t) = W(t−1) + δ t−1 Xt XtT ,
(1.6)
where δ is a constant larger than one. The concept of the marginalist rule is that the acquisition intensity for learning the last pattern should be greater than the background intensity. Hence, the last pattern can be stored with high chance. Nadal et al. (1986) numerically showed that k0 also can be proportional to n for the marginalist rule when a small number of errors are allowed in the retrieved patterns. However, the role of choosing δ (with which the number of recent library items being nearly correctly stored is as large as possible) was not mentioned in Nadal et al. (1986). Although equations 1.4 and 1.6 are similar after some mathematical treatments, their effects on the magnitude of the weights are different. With equation 1.6, the magnitude exponentially increases. On the other hand, with equation 1.4, the magnitude is upper bounded by O(n/log n) in the √ worst case and upper bounded by O( n) in the probabilistic sense. Note that equation 1.4 can also be written as the sum of the old matrix and an updating matrix: W(t) = W(t−1) + 1W(t) ,
(1.7)
where 1W(t) = (α f − 1)W(t−1) + Yt Xt T . Instead of using the classical nonrigorous replica method and allowing error in the retrieved pattern, our main goal is to study the storage behavior of the BAM under the simple forgetting rule (cf. equation 1.4) when error is not allowed in the retrieved pair. Also, we discuss the role of choosing the forgetting constant α f . Under some assumptions, we will prove the following theorem in the next section. Theorem 1. Under the forgetting learning (cf. equation 1.4), if a BAM is trained with t library pairs and k is less than log
(1−α f2 ) min(n,p) 2 log min(n,p)
2 log α1f
(1.8)
then the probability of the (t − k)th library pair (Xt−k , Yt−k ) being a fixed point tends to 1, as n → ∞, p → ∞.
The Behavior of Forgetting Learning in Bidirectional Associative Memory 389
By theorem 1, given α f , we can determine the number of the most recent library pairs that the BAM can correctly store. Hence, the best value of α f can be determined such that the BAM can correctly store as many of the most recent library pairs as possible. In the next section, we will analyze the storage behavior of the BAM under forgetting learning. We first prove theorem 1. Hence, we can determine whether the most recent k library pair, (Xt−k , Yt−k ), can be stored as a fixed point. We then discuss how to choose the forgetting constant such that log
(1−α f2 ) min(n,p) 2 log min(n,p)
2 log α1f is as large as possible (see corollary 1). In Section 3, the simulation is given for the verification of the analysis. Section 4 addresses the magnitude of the weights in deterministic and probabilistic senses. A brief conclusion comprises Section 5. 2 Properties of Forgetting Learning The following assumptions and notations are used: • The dimensions, n and p, are large. Also, p = rn, where r is a positive constant. • Each component of the library pairs (Xh , Yh ) is a ±1 equiprobable independent random variable. • EUj,t−k is the event that the jth component of sgn(W (t) Xt−k ) is equal to the jth component of Yt−k . Also, EUj,t−k is the complement event of EUj,t−k . T
• EVi,t−k is the event that the ith component of sgn(W (t) Yt−k ) is equal to the ith component of Xt−k . EV i,t−k is the complement event of EVi,t−k . Lemma 1.
Chebyshev’s inequality: For any random variable χ and u ≥ 0,
Prob(χ ≥ u) ≤ inf e−τ u E(eτ χ ). τ ≥0
Lemma 2. Let v1 , v2 , . . . , vN be independent ±1 equiprobable random variables P and SN = N i=1 vi , ·
½
SN E exp τ √ N
¾¸
h
n
oi
≤ E exp τ Sˆ
½
¾ τ2 = exp , 2
390
Chi Sing Leung and Lai Wan Chan
for τ > 0, where Sˆ is a standard normal variable. In general, ·
½
SN E exp τβ √ N
¾¸
½
β 2τ 2 ≤ exp 2
¾
where β is a real number. Lemma 3.
The probability Prob(EUj,t−k ) is less than
( exp −
(1 − α f2 )α f2k n
)
2
for j = 1, . . . , p. Proof of Lemma 3. Without loss of generality, we consider that the library pair (Xt−k , Yt−k ) has all positive components; that is, Xt−k = (1, . . . , 1)T and Yt−k = (1, . . . , 1)T . Then,
√ S j,h α ft−h √ > α fk n , Prob(EUj,t−k ) = Prob n h=1,h6=t−k t X
(2.1)
where Sj,h = yjh
n X
xih .
i=1
From lemma 2, ·
½
Sj,h E exp τ α ft−h √ n
¾¸ ≤ exp
α 2t−2h τ 2 f
2
.
By Chebyshev’s inequality (cf. lemma 1),
Prob(EUj,t−k ) ≤ inf exp{−τ u} exp τ ≥0
P th=1,h6=t−k α 2t−2h τ 2 f
2
( P∞
0
≤ inf exp{−τ u} exp
2h h0 =0 α f
2
τ ≥0
≤ inf exp{−τ u} exp{ τ ≥0
τ2 2(1 − α f2 )
}
τ
) 2
The Behavior of Forgetting Learning in Bidirectional Associative Memory 391
√ where u = α fk n. Clearly, the right-hand side becomes minimum when τ = u(1 − α f2 ). With such a value of τ ,
(
Prob(EUj,t−k ) ≤ exp −
(1 − α f2 )α f2k n
)
2
for j = 1, . . . , p. The probability Prob(EV i,t−k ) is less than
Lemma 4. (
exp −
(1 − α f2 )α f2k p
)
2
for i = 1, . . . , n. Proof of Lemma 4. The proof is similar to that of lemma 3. With lemmas 3 and 4, theorem 1 can be proved in the following way: We denote the probability that (Xt−k , Yt−k ) is a fixed point as P∗ . Then, ¡ ¢ P∗ = Prob EU1,t−k ∩ · · · ∩ EUp,t−k ∩ EV1,t−k ∩ · · · ∩ EVn,t−k ´ ³ = 1 − Prob EU1,t−k ∪ · · · ∪ EUp,t−k ∪ EV 1,t−k ∪ · · · ∪ EV n,t−k ´ ³ ´ ³ (2.2) ≥ 1 − pProb EU1,t−k − nProb EV 1,t−k . One should be aware that ³ ´p ³ ´n 1 − Prob(EV 1,t−k ) P∗ 6= 1 − Prob(EU1,t−k )
(2.3)
because the above events, EUj,t−k ’s and EVi,t−k ’s, are not mutually independent. With lemmas 3 and 4, it is easy to see that in the following condition the right-hand side of equation 2.2 tends to one as n tends to ∞ (also p → ∞ due to p = rn and r is a constant). (1−α f2 )n (1−α f2 )p (1−α 2 ) min(n,p) log log 2 logf min(n,p) log 2 log n 2 log p , . k < min = 2 log α1f 2 log α1f 2 log α1f Thus, theorem 1 is obtained. From theorem 1, the most recent k library pair, (Xt−k , Yt−k ), can be stored as a fixed point with high probability if k<
log
(1−α f2 ) min(n,p) 2 log min(n,p) 2 log α1f
.
392
Chi Sing Leung and Lai Wan Chan
Figure 1: Typical plot of f (α f ) where α f ∈ (0, 1) and min(n, p) = 32.
Given p and n, let
f (α f ) =
log
(1−α f2 ) min(n,p) 2 log min(n,p)
2 log α1f
.
(2.4)
The most interesting point is how to choose the value of α f ∈ (0, 1) such that f (α f ) is maximal. We denote the value of α f , with which f (α f ) is maximal, as α f,max . Clearly, f (α f ) is a continuous function of α f ∈ (0, 1). Also, d f (α f ) |αf =0+ > 0 d αf and d f (α f ) |αf =1− < 0 . d αf Hence, f (α) has at least one local maximum when 0 < α f < 1. A typical plot of f (α f ) is shown in Figure 1. Using simple numerical method, we obtain Table 1, which summarizes the values of α f,max at different values of min(n, p). From the table, as min(n, p)
The Behavior of Forgetting Learning in Bidirectional Associative Memory 393 Table 1: Summary of α f,max and f (α f,max ) Values, Found from Numerical Method. min(n, p)
α f,max
f (α f,max )
16 32 64 128 256 512 1024 2048 4096 8192
0.613 0.742 0.837 0.902 0.943 0.9675 0.982 0.9900 0.9945 0.9970
0.6012 1.223 2.3453 4.3611 7.9967 14.599 26.673 48.906 90.079 166.721
Table 2: Summary of α f,max and f (α f,max ) Values, Found from Equation 2.5. min(n, p)
α f,max
f (α f,max )
16 32 64 128 256 512 1024 2048 4096 8192
0.2407 0.6412 0.8042 0.9810 0.9393 0.9663 0.9814 0.9989 0.9945 0.9970
0.35 1.13 2.29 4.33 7.98 14.59 26.67 48.91 90.08 166.72
increases, α f,max increases. Also, for large min(n, p), α f,max is nearly equal to one. Based on this phenomenon, we can obtain the approximation of α f,max : s α f,max ≈
¾ min(n, p) +1 . 1 − exp − log 2 log min(n, p) ½
(2.5)
Equation 2.5 gives us a guideline to choose the value of α f . Table 2 shows α f,max at different values of min(n, p) based on equation 2.5. From the two tables, equation 2.5 is a good approximation of α f,max when min(n, p) is large. Substituting equation 2.5 into theorem 1, we can obtain the following corollary.
394
Chi Sing Leung and Lai Wan Chan
Corollary 1.
Under the forgetting learning with s
α f,max ≈
½ 1 − exp − log
¾ min(n, p) +1 2 log min(n, p)
if a BAM is trained with t library pairs and k is less than min(n, p) , 2e log min(n, p)
(2.6)
then the probability of the (t − k)th library pair (Xt−k , Yt−k ) being a fixed point tends to one, as n → ∞, p → ∞. From corollary 1, the storage ability of the forgetting learning is similar to that of Kosko’s encoding method. With Kosko’s encoding method, the BAM can encode only up to min(n, p)/2 log min(n, p) library pairs. Further encoding of the new library pairs will damage the entire system. On the other hand, the forgetting learning can encode any number of library pairs and always keep the most recent min(n, p)/2e log min(n, p) library pairs in the model. Hence, the BAM with the forgetting learning is similar to shortterm memory, which always extracts the most recent information from the environment. 3 Simulation 3.1 Optimal Forgetting Constant. We have carried out a simulation to verify equation 2.5 and corollary 1. The dimensions are 512. We randomly generate the library pairs and use the forgetting rule to encode them with different forgetting constants. Figure 2 shows the percentage of the most recent k library pair being successfully stored. Note that k = 0 means the current library pair. Suppose that we use 90% as the threshold. From Figure 2, if the value of α f is too large (such as 0.990), all library pairs (both old and new items) may not be stored as fixed points. When we use a very small value of α f (such as 0.940), the number of the most recent items being stored as fixed points becomes small. In general, the case of α f = 0.9663 (based on Table 2) is better than other cases. Also, there is a sharply decreasing change for k > 14 when α f = 0.9663. This result is consistent with our theoretical work presented in Tables 1 and 2. 3.2 Comparison with Learning Within Bounds. Additionally, we have carried out another simulation to compare the performance of equation 1.4 with that of the learning within bounds (Hopfield 1982). Note that for the
The Behavior of Forgetting Learning in Bidirectional Associative Memory 395
Figure 2: The percentage of the most recent previous library pair being successfully stored. Here, error is not allowed in the retrieved items and n = p = 512. The forgetting rule is equation 1.4, and the forgetting constant is varied.
BAM, learning within bounds can be used only when p = n. To our knowledge, for the BAM the selection role about the constant A in the learning within bounds has not been reported when p 6= n. The dimensions are 512. For equation 1.4, the forgetting constant is based on equation 2.5. For learning within bounds, the constant A is 0.35, which is the optimal value in Hemmen et al. (1988). Figure 3 shows the percentage of the most recent k previous library pair being successfully stored when error is not allowed in the retrieved items. For equation 1.4, the BAM can store the recent 14 library pairs. However, for learning within bounds, the BAM can store only the recent 5 library pairs. Hence, the performance of equation 1.4 is much better than that of learning within bounds when error is not allowed in the retrieved items. Figure 4 shows the percentage of the most recent k previous library pair being successfully stored when 5% errors are allowed in the retrieved items. For equation 1.4, the BAM can store the recent 24 library pairs. For learning within bounds, the BAM can store the recent 27 library pairs. Hence, the performance of learning within bounds is only a little better than that of equation 1.4 when a small number of errors is allowed in the retrieved items.
396
Chi Sing Leung and Lai Wan Chan
Figure 3: The percentage of the most recent k previous library pair being successfully stored. Here, error is not allowed in the retrieved items and n = p = 512. Two forgetting rules are used: equation 1.4 and learning within bounds. The forgetting constant α f in equation 1.4 is based on equation 2.5. The constant A in the “learning within bounds” is 0.35, which is the optimal value suggested in Hemmen et al. (1988). From this simulation, the performance of equation 1.4 is much better than that of the “learning within bounds” when error is not allowed in the retrieved items.
Figure 4: The percentage of the most recent k previous library pair being successfully stored. Here, 5% errors in the retrieved items are allowed and n = p = 512. Two forgetting rules are used: equation 1.4 and “learning within bounds.” The forgetting constant α f in equation 1.4 is based on equation 2.5. The constant A in the “learning within bounds” is 0.35, which is the optimal value suggested in Hemmen et al. (1988). From this simulation, the performance of the learning within bounds is only a little better than that of equation 1.4 when a small number of errors are allowed in the retrieved items.
The Behavior of Forgetting Learning in Bidirectional Associative Memory 397
Figure 5: The percentage of the most recent k previous library pair being successfully stored when equation 1.4 is used. The precision is ±19.999.
3.2.1 Behavior Under Fixed-Point-Number Machine. The real-valued computation in equation 1.4 creates some difficulties in the implementation. Here, we investigate the behavior of equation 1.4 when the precision of the weights, as well as the precision of the computation, is limited. Three schemes of the precision are considered. The dimensions are 512. We randomly generate the library pairs and use the forgetting rule under limited precision to encode them. In the first scheme, the precision is ±19.999 (4 12 digits machine). Due to the limitation of the precision, α f is set to 0.966. This means that all computations (the intermediate and final results) in equation 1.4 are rounded to a fixedpoint number with three decimal places between ±19.999. The simulation result is shown in Figure 5. From Figure 5, the performance under this precision is quite similar to that under float-point number precision (see Fig. 2 also). The second precision is ±19.99 (3 12 digits machine), and α f is set to 0.96. The simulation result is shown in Figure 6. From Figure 6, the BAM can still store 14 recent library pairs when error is not allowed in the retrieved items. When 5% errors are allowed, it can store only about 22 recent library pairs. The third precision is ±9.9 (2 digits machine), and α f is set to 0.9. The simulation result is shown in Figure 7. From Figure 7, the performance of the BAM is greatly reduced, and it can store only about 10 recent library pairs. The main cause of such reduction comes from the incorrect α f value. From theorem 1, when α f is set to 0.9, the BAM can store about 10 recent library pairs.
398
Chi Sing Leung and Lai Wan Chan
Figure 6: The percentage of the most recent k previous library pair being successfully stored. Here, equation 1.4 is used. The precision is ±19.99.
Figure 7: The percentage of the most recent k previous library pair being successfully stored. Here, equation 1.4 is used. The precision is ±9.9.
To sum up, when sufficient precision is used (for n = p = 512, it is about 3 digits), corollary 1 still holds. When the precision limits the value of α f , we can use theorem 1 to predict the performance.
The Behavior of Forgetting Learning in Bidirectional Associative Memory 399
4 Magnitude of the Weights 4.1 The Worst Case Bound. Let wt,ji be the weight between the jth neuron in FY and the ith neuron in FX . After t library pairs have been encoded using equation 1.4, wt,ji =
t X
α ft−h xh,i yh,j .
(4.1)
h=1
In the worst case, xh,i = yh,j for all h (or xh,i = −yh,j for all h), |wt,ji | =
t X
α ft−h ≤
h=1
1 . 1 − αf
(4.2)
Clearly, when α f is a constant, the magnitude of wt,ji is upper bounded by O(1). If α f is selected based on equation 2.5, then |wt,ji | ≤
r 1−
=
1
n
min(n,p)
(4.3)
o
1 − exp − log 2 log min(n,p) + 1 1
1 2e log min(n,p) 2 min(n,p)
+
1 2e log min(n,p) 2 ) 8( min(n,p
+ high order terms
.
(4.4)
Hence, for large n and p (note that p = rn), µ |wt,ji | ≤ O
¶ n . log n
(4.5)
4.2 The Probabilistic Bound. If xh,i and yh,j are ±1 equiprobable independent random variables, wt,ji (based on equation 1.4) can be expressed as wt,ji =
t X
α ft−h zh ,
(4.6)
h=1
where zh ’s are ±1 equiprobable independent random variables. Clearly, the distribution of wt,ji is symmetric. Let η be a standard normal random variable with variance one. It is not difficult to show that E(exp{τ zh }) < E(exp{τ η}). With lemma 1, n o n o p p Prob |wt,ji | > A min(n, p) ≤ 2 inf exp −τ A min(n, p) τ ≥0
400
Chi Sing Leung and Lai Wan Chan
× exp
P th=1 α 2t−2h τ 2 f
2
n
o p ≤ 2 inf exp −τ A min(n, p) τ ≥0
) τ2 , × exp 2(1 − α f2 ) (
where A is a positive constant. The right-hand side becomes minimum when p τ = A min(n, p) (1 − α f2 ) . With such a value of τ , ) ( o (1 − α f2 )A2 min(n, p) p . (4.7) Prob |wt,ji | > A min(n, p) ≤ 2 exp − 2 n
p When α f is a constant, the probability Prob{|wt,ji |) > A min(n, p)} exponentially tends to zero as n → ∞. If α f is selected based on equation 2.5, o n o n p Prob |wt,ji |) > A min(n, p) ≤ 2 exp −eA2 log min(n, p) .
(4.8)
It also tends to zero as n → ∞. To sum up, the √ magnitude of the weight is upper bounded by p O( min(n, p)) = O( n) in the probabilistic sense when equation 1.4 is used. 5 Conclusion We have examined the statistical storage behavior of the BAM under the forgetting learning. Also, we have derived the formula for choosing the forgetting constant: s ¾ ½ min(n, p) +1 . α f,max ≈ 1 − exp − log 2 log min(n, p) With this value, the forgetting learning rule can encode any number of library pairs and always keep the most recent min(n, p) 2e log min(n, p) library pairs in the BAM. Computer simulations have been carried out to verify our theoretical work. The magnitude of the weights was also discussed.
The Behavior of Forgetting Learning in Bidirectional Associative Memory 401
The results presented here can be extended to a Hopfield network. By adopting the approach we set out, we can easily obtain the result in a Hopfield network by replacing min(n, p) with n in all of the equations in this article. Acknowledgments We are very thankful to the reviewers for their valuable suggestions for improving this article. References Forrest, B. M., and Wallace, D. J. 1991. Storage capacity and learning in Ising-Spin neural networks. In Models of Neural Networks, E. Domany, J. L. van Hemmen, and K. Schulten, eds. Springer-Verlag, Berlin. Haines, K., and Nielsen, R. H. 1988. A BAM with increased information storage capacity. Proc. of the 1988 IEEE Int. Conf. on Neural Networks, 181–190. Hemmen, J. L., Keller, G., and Kuhn, R. 1988. Forgetful memories. Europhys. Lett. 5, 663–668. Hemmen, J. L., and Kuhn, R. 1991. Collective phenomena in neural networks. In Models of Neural Networks. Springer-Verlag, Berlin. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. 79, 2554–2558. Kohonen, T. 1972. Correlation matrix memories. IEEE Trans. Comput. 21, 353– 359. Kosko, B. 1988. Bidirectional associative memories. IEEE Trans. on Syst., Man, and Cybern. 18, 49–60. Kosko, B. 1992. Neural Networks and Fuzzy Systems. Prentice Hall, Englewood Cliffs, NJ. Leung, C. S. 1993. Encoding method for bidirectional associative memory using projection on convex sets. IEEE Trans. on Neural Networks 4, 879–871. Leung, C. S., Chan, L. W., and Lai, E. 1995. Stability, capacity and statistical dynamics of second order bidirectional associative memory. IEEE Trans. Syst., Man, and Cybern. 25, 1414–1424. Nadal, J. P., Toulouse, G., Changeux, J. P., and Dehaene, S. 1986. Networks of formal neurons and memory palimpsests. Europhys. Lett. 1, 535–542. Palm, G. 1980. On associative memory. Biolog. Cybern. 36, 19–31. Parisi, G. 1986. A memory which forgets. J. Phys. A:Math. Gen. 19, L617–L620. Wang, Y. F., Cruz, J. B., and Mulligan, J. H. 1990. Two coding strategies for bidirectional associative memory. IEEE Trans. on Neural Networks 1, 81–92.
Received May 30, 1995; accepted April 25, 1996.
Communicated by W. Thomas Miller
Adaptive Encoding Strongly Improves Function Approximation with CMAC Martin Eldracher Alexander Staller Ren´e Pompl The Boston Consulting Group, Eine Gesellschaft Buergerlichen Rechts, Sendlinger Strasse 7, 80331 Muenchen, Germany
The Cerebellar Model Arithmetic Computer (CMAC) (Albus 1981) is well known as a good function approximator with local generalization abilities. Depending on the smoothness of the function to be approximated, the resolution as the smallest distinguishable part of the input domain plays a crucial role. If the binary quantizing functions in CMAC are dropped in favor of more general, continuous-valued functions, much better results in function approximation for smooth functions are obtained in shorter training time with less memory consumption. For functions with discontinuities, we obtain a further improvement by adapting the continuous encoding proposed in Eldracher and Geiger (1994) for difficult-to-approximate areas. Based on the already far better function approximation capability on continuous functions with a fixed topologically distributed encoding scheme in CMAC (Eldracher et al. 1994), we present the better results in learning a two-valued function with discontinuity using this adaptive topologically distributed encoding scheme in CMAC. 1 Introduction Hard-limiting coarse coding elements in CMAC (Albus 1981) cause artificial discontinuities in the stored function and the partial derivatives. To avoid this, An et al. (1991) propose so-called superspheres instead of binary units but do not present any results. Triangle-shaped activation functions (Nie and Linkens 1993) leave no discontinuities except undefined partial derivatives. For moderate memory consumption and quick initial learning with large receptive fields while keeping a reasonable resolution (i.e., a reasonable size of memory), Mischo (1992) scales memory access and learning with a function of the distance of the input vector to the receptive field of each weight. Using a gaussian function in principle this results in gaussianshaped receptive fields (other functions are possible). Eldracher et al. (1994) proposed a direct implementation of gaussian-shaped receptive fields, using a one-dimensional topologically distributed encoding (TDE) in CMAC Neural Computation 9, 403–417 (1997)
c 1997 Massachusetts Institute of Technology °
404
Martin Eldracher, Alexander Staller, and Ren´e Pompl
(Eldracher and Geiger 1994). Compared to binary CMAC, this yields much better results on smooth functions (Eldracher et al. 1994). However, there remain problems with discontinuities. The larger the derivatives of the function to approximate, the larger are the difficulties (given a fixed number of receptive fields). These difficulties could be remedied by introducing more encoding units, but computation time and memory consumption would increase, too. Therefore we keep the same number of encoding units but adapt their positions in order to concentrate units in difficult areas—those with still higher-than-average output errors. 2 General Model of CMAC The following description of CMAC is a generalization of Albus (1981) and Tolle and Ersu¨ (1992). Details can be found in Appendix A. A quantizing interval in Albus (1981) can be seen as binary-valued activation function actij of a mossy fiber j in input dimension i. A mossy fiber is positioned at µij , the middle of its quantizing interval. The distance 1µ between the positions µij and µij+1 of two adjacent mossy fibers defines CMAC’s resolution ε. In binary CMAC % mossy fibers are always activated for each input sample. Each dimension of the input should be encoded in a topologically distributed manner. Therefore each binary activation function is replaced by a normalized, gaussian-shaped activation function with mean µij . The variance σij defines the overlap between neighboring mossy fibers. For a local generalization, the activation function actij is defined to zero if it is smaller than a constant amin . In order to get the highest possible encoding accuracy, one must choose σ = 2k √ε with k = 2, 3, . . . (Kinder and Brauer 1993). This 2 is due to the fact that the encoding of the number that is encoded using a specific active mossy fiber is better, the steeper the flank of the gaussian (i.e., the larger the difference in the mossy fiber’s activation for a certain small difference in the input). With the cited σ , the different gaussians of neighboring mossy fibers intersect so that there is always some mossy fiber maximally steep (i.e., close to its maximum gradient). However, preliminary experiments showed that it is not necessary to choose this theoretically optimal σ , since the coding accuracy is by far enough, and basically σ only determines the overlap of the receptive fields of neighboring mossy fibers (i.e., the number of simultaneously active granule cells, or the generalization). Now % serves only to determine the number and positions of mossy fibers, given encoding interval [0, smax [ and resolution ε. Taking into account the modified mossy fibers, a granule cell v ∈ A is defined as in the original model. To generate a granule cell v, mossy fibers congruent to modulo % are combined such that each v is connected to exactly one mossy fiber from each input dimension. v is named uniquely by the tuple of numbers ji of the mossy fiber j in dimension i connected to v. The
Adaptive Encoding Strongly Improves Function Approximation
405
position µv of a granule cell v = (j1 , . . . , jn ) ∈ A is defined as the n-tuple of the positions of its mossy fibers: µv := (µ1j1 , . . . , µnjn ). The activation function av of a granule cell v is the product of the mossy fiber activation function, an n-dimensional gaussian. Each granule cell v ∈ A is provided with a weight wv ∈ R. Using only trained weights, output p is computed as the mean of the products of the granule cells’ activities and their weights. For learning we generalize a method from Tolle and Ersu¨ (1992). Given target output p∗ , first untrained weights are set to p∗ . Then the remaining error is evenly distributed over all weights, which are trained using delta rule with learning rate η (0 < η ≤ 1).
3 Adaptive Topologically Distributed Encoding The aim of adaptive topologically distributed encoding (ATDE) is to use fewer encoding units (i.e., mossy fibers for ATDE in CMAC) in order to use less memory and get quicker architectures with better classification ratios (Eldracher and Geiger 1994). This is achieved by moving mossy fibers into those parts of their input dimension where they are really needed instead of distributing them uniformly. As the positions µij of the mossy fibers change during the adaptation process, σ and ε = 1µ are no longer equal for all encoding units. As the left and right neighbors of a specific mossy fiber may have different distances 1µ, two different σ are used for each mossy fiber. After moving a mossy fiber, all σij are adapted in order to keep the highest possible encoding accuracy (see Section 2). Whenever an input pattern is presented to the network, the center of the most activated encoding unit bmu (best matching unit) in each input dimension is moved a bit closer to this data point. The topological neighbors are moved into the direction of the input sample too. Afterwards the widths σij have to be adapted in all input dimensions. The adaptation of the positions µij of the mossy fibers is based on the distance of the training pattern to µij and the output error. This implies that encoding units concentrate around patterns that caused high output errors during training. This yields smaller distances between the mossy fibers in these areas and therefore steeper flanks of the corresponding gaussians as σ depend on 1µ. As a result, a higher accuracy can be achieved in these areas, as well as the distinction of input patterns with totally different output values (e.g., at discontinuities). The displacement of the bmu is scaled with a fixed learning rate ηbmu . For the neighbors, we scale ηbmu with a function h(i), depending on the topological order over the mossy fibers. A mossy fiber has the number i if it is the ith mossy fiber on this input dimension with respect to the center’s µ. If the best matching unit is numbered bmu, we define the (discrete) neigh-
406
Martin Eldracher, Alexander Staller, and Ren´e Pompl
1 borhood function h(i) as h(i) = |i−bmu|+1 . Note that 0 < h(i) ≤ 1 (especially h(bmu) = 1), and h(i) is symmetrically decreasing around bmu. h(i) does not depend on the actual distance between the center of a mossy fiber and the training input. In order to achieve convergence in the mossy fiber placing, all learning rates are diminished exponentially in time with cooling factor δ (0 < δ < 1). Starting with δ 0 = 1 at epoch 1 (one training epoch equals a time step), the mossy fibers’ movement decreases exponentially every epoch by multiplication with δ (epoch−1) . Furthermore, we have to scale the mossy fibers’ moving in order to prevent single mossy fibers from overtaking each other, which would destroy the topological encoding. We do this by dividing the size of the movement by the maximum output error of CMAC, the difference of the smallest and largest output value, here that is smax = 1 (see Section 2 and Appendix A). Altogether for input s the positions µ of all mossy fibers falling in the N neighborhood of the best matching unit change according to
µ(new) := µ + (s − µ) · |p − p∗ | · h(i) · δ (epoch−1) /smax .
(3.1)
The learning of the weights remains as in standard CMAC, based on the activations before the positions of the mossy fibers are changed (i.e., because the output errors are computed for this mossy fiber distribution). This implies that there will be some artificial approximation errors after changing the mossy fibers’ positions. With small learning constants for the mossy fibers’ movement, this should not cause any negative effect. Especially as the changes in the positions of the mossy fibers get smaller and smaller using a cooling factor (see, e.g., Ritter et al. 1991), after some time the mossy fiber distributions remain constant. From this point, there is no more interplay between weight change and mossy fiber distribution change. Therefore the choice of the cooling factor is not critical. A cooling that is too rapid will result in a mossy fiber distribution close to the initial distribution; a cooling that is too slow might result in an elongated learning process but will never prevent learning. 4 Results Since we know that there remain some difficulties with TDE in CMAC to approximate discontinuous functions (Eldracher, Staller, and Pompl 1994), we use a discontinuous function EDGE (Tolle and Ersu¨ 1992) to show the capabilities of ATDE in CMAC: EDGE :
[0; 1]2 → [0; 1] ½ 0.75s1 EDGE(s1 , s2 ) = 1
if else
s1 < 0.6 ∧ s2 < 0.5
.
While EDGE is very easy to approximate with a binary CMAC since it is nearly everywhere constant or linear, EDGE is an example of the worst case
Adaptive Encoding Strongly Improves Function Approximation
407
Table 1: Approximation of the Function EDGE with Binary Activation Functions. %ε
%
2−2 2−3 2 − 4* 2−5
16 16 16 16
ε
2−6 2−7 2−8 2−9
η
0.1 0.5 1.0 1.0
1089 Test Patterns After First Epoch After Tenth Epoch Emean Esq Emax Emean Esq Emax 10.9% 14.2% 43.6% 5.8% 9.9% 51.6% 2.6% 8.6% 68.4% 6.9% 24.2% 100.0%
5.6% 8.9% 43.2% 4.2% 7.6% 48.9% 2.5% 8.0% 71.8% 6.9% 24.2% 100.0%
*The parameters in this row were used in Figure 1.
concerning approximation with TDE. To produce a constant output value using nonlinear, gaussian-shaped mossy fiber activation functions, different weights have to be chosen for each active granule cell depending on the relative position of a granule cell to the training pattern. This is different for a constant activation function as in original CMAC (Albus 1981), where the relative position does not have any influence, as well as for linear triangle-shaped activation functions (Nie and Linkens 1993), where in each dimension the loss of activation in one mossy fiber is exactly compensated by an increase in the neighboring mossy fiber’s activation. Consequently with TDE in CMAC, a weight cannot fit when test patterns different from the training patterns are used. Furthermore, an edge is difficult to model with relatively large, fixed-sized receptive fields. Both effects could be remedied. For the former problem, a higher number of training examples would help to adapt all weights better. For the latter problem, more encoding units could be used. We do not try to do this here, since we want to compare our approaches with exactly the same conditions. The training set consists of 1000 random samples. The test set contains 1089 samples on a regular mesh. For quality measurement, mean absolute error, mean squared error, and maximum absolute error are used. 4.1 Binary Encoding. For the binary encoding we need a smaller resolution ε to get good results. Since we want to use similar sizes %ε for the mossy fibers (and granule cells) in all experiments, we have to use a larger % = 16. However, this yields CMAC storages that use about four times more memory than comparable (A)TDE CMACs. Results with binary CMACs are shown in Table 1 and Figure 1. 4.2 Topologically Distributed Encoding. In order to get a coding-opti7 9 mal σ , we set all σ = 2− 2 , 2− 2 (two local optima of the encoding accuracy, yielding different-sized receptive fields; see Section 2). With values amin =
408
Martin Eldracher, Alexander Staller, and Ren´e Pompl
Figure 1: Binary CMAC: Approximation of EDGE (left) and remaining absolute error (right) for % = 16, ε = 2−8 , and η = 1.0. Table 2: Approximation of EDGE with TDE in CMAC with η = 0.5.
σ
%ε
− 72
2
2
−4
%
ε
−6
4 2
2− 2
7
2−5 4 2−7
9
2−4 4 2−6
2− 2
− 92
2
2
−5
−7
4 2
amin e−1 e−2 e−4 e−1 e−2 e−4 e−1 e−2 ∗ e−4 e−1 e−2 e−4
1089 Test Patterns After First Epoch After Tenth Epoch |A∗ | Emean Esq Emax Emean Esq Emax 47 5.9% 11.2% 87 6.5% 11.8% 168 6.3% 11.6% 184 5.9% 11.2% 339 6.5% 11.8% 651 6.3% 11.6% 12 5.6% 9.8% 23 5.0% 10.0% 47 4.9% 10.2% 48 5.1% 9.4% 90 4.8% 9.9% 184 4.9% 10.1%
52.7% 57.2% 56.4% 53.1% 57.5% 56.2% 55.5% 55.2% 58.3% 53.1% 54.3% 55.7%
5.3% 6.1% 5.6% 5.1% 6.1% 5.6% 3.8% 3.8% 3.7% 3.5% 3.8% 3.6%
10.0% 10.9% 10.5% 9.8% 10.9% 10.5% 8.8% 9.1% 9.1% 8.5% 9.1% 9.1%
56.9% 55.8% 57.3% 58.6% 56.5% 56.9% 60.4% 56.7% 61.2% 59.4% 55.9% 60.9%
*The parameters in this row were used in Figure 2.
√ e−4 , e√−2 , e−1 the circular receptive fields around µv have radii of 2 2σ, 2σ , and 2σ . Results with learning rate η = 0.5 are given in Table 2, which also gives the average number |A∗ | of active granule cells (for binary activation functions there would be |A∗ | = % = 4). Figure 2 shows an approximation result.
Adaptive Encoding Strongly Improves Function Approximation
409
Figure 2: CMAC with TDE: Approximation of EDGE (left) and remaining ab9 solute error (right) for σ = 2− 2 , % = 4, ε = 2−6 , amin = e−4 , and η = 0.5.
4.3 Adaptive Topologically Distributed Encoding. We fix % = 4 for an initial 1µinit = εinit = 2−6 , 2−7 . In order to get an optimal initial σinit , we set 7 9 σinit = 2− 2 , 2− 2 . Furthermore we choose amin = e−4 , e−2 , e−1 , and δ = 0.5. Because there might be a different optimal number N of neighbors to be adapted with the best matching unit, some preliminary tests are conducted. N = 40 gives the best results (the total number of mossy fibers for each dimension is 68). It seems to be sound to use a large N to approximate a discontinuous function, since for the approximation of the discontinuity, many mossy fibers have to be concentrated. A large number of neighbors to be adapted ensures a more global change of the overall distribution of the mossy fibers, making it easier to get mossy fibers really dense at certain positions. It is therefore not surprising that after adaptation and learning, mean and squared errors are lower for the adaptive continuous activation functions (compare Tables 2 and 3 and Figs. 2 and 3). However, the offline investigation for N was done only for academic reasons and showed that N can be varied within large ranges without harmful results. If a small N is chosen, the adaption process is only slower. Therefore decreasing the cooling to elongate the mossy fibers’ position change will allow reaching the same results with a smaller N. Figure 4 shows the final distribution of the mossy fibers for fixed (left) and adaptive (right) topologically distributed encoding after learning. Each line is drawn from a position of a mossy fiber. Positions of the granule cells lie on the intersection points of lines equal modulo % (cf. Section 2). Missing to model the cross-correlations between the coordinates directly in the encoding, the concentration of granule cells also rises in areas where this is not needed to approximate the function. But this small drawback is
410
Martin Eldracher, Alexander Staller, and Ren´e Pompl
Table 3: Approximation of EDGE with ATDE in CMAC, η = 0.5, N = 40, and δ = 0.5.
σinit
%εinit
− 72
amin
|A∗ |
e−1
24
3.2%
7.6%
44.6%
2.7%
7.1%
61.3%
e−2 e−4 e−1
43 79 138
4.4% 4.8% 4.9%
9.1% 9.4% 10.1%
43.4% 46.3% 62.6%
3.5% 3.5% 3.9%
8.2% 8.2% 9.4%
54.3% 55.1% 58.7%
2−7
e−2 e−4 e−1
264 498 8
5.7% 5.6% 6.8%
10.9% 10.5% 13.8%
64.9% 61.1% 81.7%
4.5% 4.5% 2.1%
9.8% 9.5% 6.2%
64.5% 60.9% 64.4%
4
2−6
e−2 ∗ e −4 −1 e
13 23 34
4.1% 3.2% 5.7%
8.5% 7.4% 10.6%
50.4% 53.0% 58.2%
1.8% 1.9% 2.7%
6.2% 6.5% 7.4%
68.7% 73.7% 64.8%
4
2−7
e−2 e−4
69 139
4.2% 4.2%
8.4% 8.5%
47.7% 43.6%
2.6% 2.6%
7.8% 7.7%
59.4% 57.9%
%
εinit
2−4
4
2−6
2− 2
7
2−5
4
2− 2
9
2−4
9
2−5
2
2− 2
1089 Test Patterns After First Epoch After Tenth Epoch Emean Esq Emax Emean Esq Emax
*The parameters in this row were used in Figure 3.
Figure 3: CMAC with ATDE: Approximation of EDGE (left) and remaining 9 absolute error (right) for σinit = 2− 2 , % = 4, εinit = 2−6 , amin = e−4 , η = 0.5, N = 40, and δ = 0.5.
rewarded with very quick algorithms, based purely on CMAC. If in some application the training patterns do not cover the whole input space (i.e., the state space of a process is a diagonal trajectory) this drawback is reduced by implementing hash coding for the granule cells (Tolle and Ersu¨ 1992). Hash-coding memory is allocated only for granule cells that cover areas
Adaptive Encoding Strongly Improves Function Approximation
411
Figure 4: Granule cells are positioned at intersection points for fixed (left) and adaptive (right) topologically distributed encoding after training.
with training data from the process’s state space. Therefore, even for highdimensional state spaces, often only some memory for a (possibly very small) hash table is consumed. In the case of a diagonal state space, the adaptive coding scheme basically results in different distances of the granule cells along this diagonal. Similar considerations are valid for different state space trajectories. Due to hash coding, the addressing of the granule cells is exactly the same as in the original CMAC. However, possible hardware implementations may become much less efficient using hash coding. 5 Discussion CMAC is a function approximator because no data point is modeled exactly, as necessary for function interpolation. Rather, CMAC tries to minimize the overall error of all data, that is, approximate the function. Even using a learning rate of η = 1 will not yield a function interpolation; due to the generalization of close training points, this exact match after training a particular input pattern will be altered with ongoing training. However, CMAC is a local function approximator not only because the generalization is only local, but also because there is no extrapolation over the learned granule cells, which have a certain limited area of influence. Gaussian-shaped, N-dimensional receptive fields for the granule cells are a natural result of one-dimensional gaussian activation functions for the mossy fibers of the single input dimensions. Using TDE (see Appendix B) in CMAC therefore yields well-defined partial derivatives for the approximated function. After ten epochs training with TDE in CMAC, the errors for smooth, continuous functions such as two-dimensional sine or cosine functions can be reduced to less than 30% of the smallest error with binary CMAC without any parameter fine tuning, while even after the first epoch,
412
Martin Eldracher, Alexander Staller, and Ren´e Pompl
the same result as for the best binary CMAC is reached (Eldracher et al. 1994). Also, the parameter-dependent variations of the error after training (i.e., the variations in the quality of the function approximation choosing different parameter settings) are much smaller compared to binary encoding in CMAC (Eldracher et al. 1994). Furthermore, with TDE in CMAC, we can use a smaller % for better results, which yields a linearly smaller number of mossy fibers in each input dimension. (Remember that % determines only the number and positions of mossy fibers—compare Section 2—while the generalization is now determined by parameter σ . Keeping the size %ε of the granule cells constant while decreasing % demands an increased resolution ε or generally larger distances 1µ between mossy fibers, thus decreasing the number of mossy fibers.) Considering the exponential growth of the number of granule cells with an increasing number of input dimensions, this effect should not be underestimated. Even considering that for each mossy fiber we now have to store new parameters for centers µ and variances σ , we save memory, since this is a linear increase of memory consumption opposite to the exponential decrease of weights for the granule cells due to the possible changes in % and ε. Therefore the memory savings become even better for (A)TDE CMAC when approximating higher-dimensional functions. ATDE’s direct moving of the positions µij of the mossy fibers is much more straightforward than moving the granule cells, as proposed in Hormel (1989). But opposite to Hormel (1989) ATDE does not directly model the cross-correlations between the coordinates and thus may concentrate granule cells in areas where this is not needed to approximate the function. This little drawback is rewarded with very quick, purely CMAC-based algorithms. Furthermore superfluous cells often are automatically omitted using hash coding for the granule cells within the standard CMAC concept (Tolle and Ersu¨ 1992). Opposite to Hormel (1989), ATDE adapts not only to the input patterns’ distribution but also to the remaining output error. Furthermore, inserting more encoding units in ATDE following Fritzke (1993) is easily possible (Eldracher and Geiger 1994) if a better approximation is needed. Results concerning insertion of mossy fibers will be reported elsewhere (but are available in an internal report written in German). Compared to binary encoding or fixed TDE, with ATDE in CMAC the lowest errors are obtained while preserving the advantages of fixed TDE: less consumed computing power and memory and low-parameter dependent variations. Our approach is very quick. Ten epochs with 1000 training samples take only around half a minute (on an HP Series 9000 Model 700 in multiuser operation), and good results are obtained after just the first epoch. Since in our implementation binary CMAC is about three times quicker per epoch than ATDE CMAC (2.7 times quicker than TDE CMAC), the overall speed is better for ATDE CMAC. With our absolute computation times, we are much quicker than many other adaptive coding schemes (e.g., Ritter et al. 1991; Fritzke 1993). If even faster algorithms should be necessary in real-time applications, we must keep in mind that with a quick hardware implementa-
Adaptive Encoding Strongly Improves Function Approximation
413
tion (with easy addressing but without hash coding) ATDE CMAC possibly wastes a lot of memory for sparsely populated state spaces, because it cannot model the cross-correlations of the single input dimensions. However, a binary CMAC would do the same or even worse due to a higher number of granule cells. The results show that the difficulty of modeling constant functions with TDE (see Section 4) is only a theoretical drawback, since, using ATDE, even constant functions are modeled better than using binary CMAC. Besides, the increase in the approximation error from binary CMAC to a CMAC with TDE is only marginal, whereas smoother functions are modeled much better with TDE in CMAC than with a binary CMAC (for detailed examples see Eldracher et al. 1994). The parameter N for the number of neighboring mossy fibers to be adapted simultaneously is not critical. However, if it is known that a discontinuous function should be approximated, it is better to choose larger values for N. As an application in control, we currently investigate pole balancing. This standard benchmark problem is especially interesting, since a nearly diagonal state space must be approximated. Starting from the ASE/ACE model (Barto et al. 1983) we changed the linear perceptrons of ACE and ACE into CMACs and also got better results using less memory (Lin and Kim 1991). However, using TDE in CMAC increases the learning convergence. Although each trial is about three times slower, we get an absolute speed-up of about three, since we need about ten times fewer trials until the pole is balanced in each epoch. For our first experiments, we used a CMAC with TDE that consumed more memory than the binary CMAC, since values for µ and σ had to be stored for each mossy fiber. Using ATDE CMAC we can get similar results but with further reduced memory consumption (due to reduced % with equal %ε). However, these experiments are not yet finished; a thorough examination of the expected benefits of ATDE in CMAC for control applications will be reported in detail elsewhere. Appendix A: Detailed General Model of CMAC CMAC can be defined as a number of mappings from input Es ∈ S to mossy fibers m ∈ M, further to granule cells a ∈ A, and finally to the output r ∈ R (Albus 1981). To simplify the description, we assume si to be normalized to si ∈ [0, smax [ ∀i ∈ In := {1, 2, . . . , n} and a one-dimensional output space. For higher-dimensional output functions, the same mappings are used, apart from A → R, which is different for each component of the output vector. A.1 Mapping Input to Output: S → M → A → R. Figure 5 shows the mappings to be described in the following subsections. A.1.1 Mapping S → M. Each component si of the N-dimensional input vector Es ∈ S is encoded separately by the set Mi = {1, 2, . . . , nM } of mossy
414
Martin Eldracher, Alexander Staller, and Ren´e Pompl
Figure 5: Example for encoding in a binary-valued, original CMAC for twodimensional input space, % = 4, ε = 0.25, and encoding values in [0, 5.25[2 .
fibers. Each mossy fiber j ∈ Mi is defined through its position µij and its activation function actij . To define µij , the interval [0, smax [ is divided into R subintervals of length ε, the resolution of CMAC. For each mossy fiber j ∈ Mi an interval Iij , is defined: Iij := [jε − %ε; jε[, where % ∈ 1, 2, . . . . The positions are defined as the centers of the intervals Iij : µij := jε − %ε 2 . As each number in [0, smax [ is contained in exactly % consecutive intervals, % specifies the amount of generalization. We use a normalized gaussian for actij with mean µij for each mossy fiber j ∈ Mi . The variance σij defines the overlap between neighboring mossy fibers. For a local generalization, actij is defined to be zero if it is smaller than a constant amin : ½ actij (si ) := with
gaussij (si ) 0
if else Ã
gaussij (si ) ≥ amin
1 gaussij (si ) = exp − 2
µ
si − µij σij
¶2 ! .
(A.1)
Adaptive Encoding Strongly Improves Function Approximation
415
A mossy fiber is active if its activation is different from zero. Each input coordinate si is encoded by the set of active mossy fibers M∗i := {j ∈ Mi | actij (si ) > 0}. For every input component si all activations of the mossy fibers j ∈ Mi can be summed up in a row vector of nM elements. Written one beneath the other, these vectors form a n × nM matrix M. If M is the set of the real n × nM matrices, S → M can be written as Es 7→ M = (mij )i∈In ,j∈{1,...,nM }
with
mij = actij (si ).
(A.2)
A.1.2 Mapping M → A. Each granule cell v ∈ A is defined by its position µv and its activation function av and named uniquely by the tuple of numbers ji of the mossy fibers j in dimension i connected to v. To generate a granule cell v mossy fibers’ congruent modulo % are combined, so that v is connected to exactly one mossy fiber from each input dimension. It holds: A ⊆ M1 × · · · × Mn . The position µv of a granule cell v = (j1 , . . . , jn ) ∈ A is defined as the ntuple of the positions of the mossy fibers connected to v: µv := (µ1j1 , . . . , µnjn ). The activation function av of a granule cell v is the product of the mossy fiber activation functions, an n-dimensional gaussian. A granule cell becomes active only if all connected mossy fibers are active. The input vector Es is encoded by the set of active granule cells A∗ ⊆ M∗1 × · · · × M∗n . The mapping M → A maps the matrix M to the vector aE of activations av of all granule cells v ∈ A. A.1.3 Mapping A → R. Each granule cell v ∈ A is provided with a E With weight wv ∈ R. The weights wv are arranged in the weight vector w. respect to the training indicator αv ∈ {0, 1} for wv , output p is computed using only trained weights: p = P
X
1
v∈A∗
αv av
αv av wv .
(A.3)
v∈A∗
A.2 Learning Algorithm. Given target output p∗ and learning rate η(0 < η ≤ 1), the following calculations are performed for each weight wv with v ∈ A∗ : Ã ! X X X av 2 ∗ av , 1wv = αu au p − αu au wu asquare = asquare u∈A∗ v∈A∗ u∈A∗ ½ ∗ p + η1wv if wv is not trained (new) = wv (A.4) wv + η1wv if wv is trained . First, untrained weights are set to target p∗ . Then the remaining error is evenly distributed over all weights, so changes in already trained weights are kept as small as possible to get a smooth interpolation (Tolle and Ersu¨ 1992).
416
Martin Eldracher, Alexander Staller, and Ren´e Pompl
Figure 6: Topologically distributed encoding: x-axis: numerical value sS, y-axis: receptive fields (gaussians) and activations for a specific value s1 (squares).
Appendix B: Topologically Distributed Encoding In topologically distributed encoding (Geiger 1990; Eldracher and Waschulzik 1993) for every dimension of the input space S ∈ [0, smax [ we use a number of encoding units ordered along a one-dimensional axis. To encode a value s1 , the units are activated according to their overlapping gaussianshaped receptive fields. Since several units are always active, numerical information is encoded in topologically distributed fashion (see Fig. 6). Each receptive field is specified by its center µ and standard deviation √ σ . For the highest possible accuracy of the encoding, we chose σ = 2k 1µ 2 with k = 2, 3, . . . . 1µ denotes the difference between two adjacent centers of receptive fields (i.e., the mossy fibers). As in topologically distributed encoding, all 1µ are equal; one can write ε := 1µ to close the gap to CMAC in Section A.1.1. The resolution depends on the change in the encoding with respect to the change in the input. Thus more encoding units (with steeper sides of the gaussians having closer centers) help to increase the achievable resolution. References Albus, J. S. 1981. Brains, Behaviour, and Robotics. Byte Books. An, P. C. E., Miller III, W. T., and Parks, P. C. 1991. Design improvements in associative memories for cerebellar model articulation controllers (CMAC). In Artificial Neural Networks. Proceedings of the 1991 International Conference on Artificial Neural Networks (ICANN-91), T. Kohonen, K. M¨akisara, O. Simula, and J. Kangas, eds. Elsevier Publishers B.V., North-Holland, Amsterdam. Barto, A. G., Sutton, R. S., and Anderson, C. W. 1983. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13, 834–846.
Adaptive Encoding Strongly Improves Function Approximation
417
Eldracher, M., and Geiger, H. 1994. Adaptive topologically distributed encoding. In Proceedings of the International Conference on Artificial Neural Networks (ICANN-94), M. Marinaro and P. G. Morasso, eds., 771–774. Eldracher, M., Staller, A., and Pompl, R. 1994. Function approximation with continuous valued activation functions in CMAC. Forschungsberichte Kunstliche ¨ Intelligenz FKI-199-94, Institut fur ¨ Informatik, Technische Universit¨at Munchen. ¨ Eldracher, M., and Waschulzik, T. 1993. A topologically distributed encoding to facilitate learning. Journal of Systems Engineering 110–119. Fritzke, B. 1993. Growing cell structures—a self-organizing network for unsupervised and supervised learning. Tech. Rep. TR–93-026, International Computer Science Institute, Berkeley, CA. Geiger, H. 1990. Storing and processing information in connectionist systems. In Advanced Neural Computers, R. Eckmiller, ed., pp. 271–277. Elsevier Science Publishers B.V., Amsterdam. Hormel, M. 1989. A self-organizing associative memory system for control applications. In Neural Information Processing Systems 2, D. S. Touretzky, ed., pp. 332–339. Morgan Kaufmann, San Mateo, CA. Kinder, M., and Brauer, W. 1993. Classification of trajectories—extracting invariants with a neural network. Neural Networks 6(7), 1011–1017. Lin, C., and Kim, H. 1991. Use of CMAC neural networks in reinforcement self-learning control. In Artificial Neural Networks. Proceedings of the 1991 International Conference on Artificial Neural Networks, ICANN-91, T. Kohonen, K. M¨akisara, O. Simula, and J. Kangas, eds., pp. 1285–1288. Elsevier Publishers B.V., North-Holland, Amsterdam. Mischo, W. S. 1992. Receptive fields for CMAC: An efficient approach. In Artificial Neural Networks II, Proceedings of the International Conference on Artificial Neural Networks, ICANN-92. I. Alexander and J. G. Taylor, eds. Elsevier, Amsterdam. Nie, J., and Linkens, D. A. 1993. A fuzzyfied CMAC self-learning controller. In Proceedings of the FUZZ IEEE-93, S. IEEE, 500–505. Ritter, H., Martinez, T., and Schulten, K. (eds.) 1991. Neuronale Netze: Eine ¨ Einfuhrung in die Neuroinformatik selbstorganisierender Netzwerke. AddisonWesley, Reading, MA. Tolle, H., and Ersu, ¨ E. 1992. Neurocontrol. Learning Control Systems Inspired by Neuronal Architectures and Human Problem Solving Strategies, volume 172 of Lecture Notes in Control and Information Sciences. Springer-Verlag, Berlin.
Received January 9, 1996; accepted May 20, 1996.
Communicated by Misha Mahowald
An Analog Memory Circuit for Spiking Silicon Neurons John G. Elias David P. M. Northmore Wayne Westerman Departments of Electrical Engineering and Psychology, University of Delaware, Newark, DE 19716 USA
A simple circuit is described that functions as an analog memory whose state and dynamics are directly controlled by pulsatile inputs. The circuit has been incorporated into a silicon neuron with a spatially extensive dendritic tree as a means of controlling the spike firing threshold of an integrate-and-fire soma. Spiking activity generated by the neuron itself and by other units in a network can thereby regulate the neuron’s excitability over time periods ranging from milliseconds to many minutes. Experimental results are presented showing applications to temporal edge sharpening, bistable behavior, and a network that learns in the manner of classical conditioning. 1 Introduction Neuromorphic engineers and other researchers interested in fabricating artificial neurons have long sought a simple and area-efficient on-chip analog memory device. Various types of analog memory devices have been used to store synaptic weights (e.g., Shoemaker et al. 1988; Holler et al. 1989; Lee et al. 1991; Murray and Tarassenko 1994; Hasler et al. 1995), provide offset correction in silicon retinas (Mead, 1989), represent calcium concentration in silicon neurons (Mahowald and Douglas 1991), hold state in artificial dendrites (Elias and Northmore 1995), and set conductance parameters for voltage-sensitive channels (Douglas and Mahowald 1995). Other applications of analog memory include setting the threshold and integration time constant in integrate-and-fire soma and the setting of dynamics and conductance parameters in silicon dendrites. Many of these applications may be best served with memory devices that respond rapidly to network activity and thus need to retain state for only short periods of time. As neuromorphic engineers draw closer to realizing accurate hardware models of neurons, the need for a simple analog memory device grows correspondingly. Before discussing the state of the art in integrated analog memory devices, we provide some points of reference to which existing designs can be compared. We therefore define an ideal analog memory device as one that has the following attributes: fabricated with standard integrated circuit Neural Computation 9, 419–440 (1997)
c 1997 Massachusetts Institute of Technology °
420
John G. Elias, David P. M. Northmore, Wayne Westerman
(IC) processing methods, nonvolatile storage, high precision, wide dynamic range, rapid and easy programmability, small footprint, and operable from a single power supply. There may be other desirable characteristics, but these few represent the bulk of what is needed and what is typically difficult to achieve. The capacitor is the basis for most IC analog memory, as well as for several forms of what is generally regarded as digital memory. An ideal capacitor, once charged, retains its state indefinitely. It is free of voltage, temperature, and frequency effects, and its state is easily changed over a wide range. Of all integrated devices, the simple capacitor comes closest to achieving ideal behavior. Various types of integrated planar capacitors are easily fabricated using the metal-oxide-semiconductor (MOS) standard fabrication process available through the MOS integration service (MOSIS). Foremost among the various types of integrated planar capacitors are the poly-oxide-poly and poly-oxide-substrate forms, with the former having more ideal characteristics. Because of the extreme purity of the oxide layer used to form the dielectric in these types of capacitors, they can hold their state for long periods of time. For poly-oxide-substrate capacitors, the charge leakage is so low that significant amounts of information can be retained for decades (Advanced Micro Devices 1995). When integrated capacitors are connected to other IC elements such as the drain or source terminals of a MOS field effect transistor (MOSFET), their charge retention times are greatly reduced because of unavoidable charge leakage pathways that exist in the additional circuit elements. With MOSFETs, the charge leakage is primarily through reversed-biased PN junctions. Although small in magnitude, the leakage can still be large enough to drain a small capacitor of its charge in a matter of seconds. Capacitive-based digital memory such as dynamic random access memory (DRAM) therefore needs to be refreshed (recharged) periodically to maintain state. Analog memory may have to be refreshed as well. And since a precise voltage level might be required, the refresh process may present implementation difficulties. 1.1 CAPS-and-DAC. Of the three analog memories discussed in this paper, the simplest to understand is an array of capacitors recharged by a digital-to-analog converter (CAPS-and-DAC), whose basic architecture is shown in Figure 1a. The voltage on any one capacitor is precisely set by a digital-to-analog converter (DAC) and de-multiplexer. Address lines select the memory location to be set, and data lines to the DAC encode the desired voltage. Since there are unavoidable leakage paths, the voltage of each memory must be refreshed periodically by using digitally encoded states, typically stored off-chip in conventional digital memory (SRAM or DRAM). The de-multiplexer circuitry is normally integrated along with the capacitor array, but the DAC could be located off-chip to save space and reduce pin count.
An Analog Memory Circuit for Spiking Silicon Neurons
421
Figure 1: (a) Basic architecture for CAPS-and-DAC analog memory. An n-bit DAC is used to refresh capacitor voltages through a switching network. A kbit address applied to the de-mutiplexer connects a particular capacitor in the array to the DAC output, which sets its state to the desired value. (b) Plot of the minimum refresh frequency needed to maintain a desired level of precision with CAPS-and-DAC analog memory when the decay time constant is 10 sec. (c) Basic architecture for an array of floating gates. A k-bit address selects a particular floating gate for electron removal or electron addition. The direction and rate of charge flow during programming depend on the sign and magnitude of the high voltage (HV) inputs.
422
John G. Elias, David P. M. Northmore, Wayne Westerman
The rate of voltage decay due to charge leakage in CAPS-and-DAC memory depends on temperature, leakage pathway characteristics, and the capacity of each capacitor. The decay rate measured as a percentage of initial state is independent of initial voltage. Therefore, to maintain state to a certain precision requires that refreshing occurs frequently enough to keep the associated voltage ripple, expressed as a binary fraction of the full scale swing, less than 0.5 of a least significant bit (LSB). If we assume the decay is a single exponential, then the refresh frequency needed to maintain state with a ripple less than 0.5 LSB is given by f =
−1.0 h τ · log 1 − 2V
Vf s N 0 ·(2 −1)
i,
(1.1)
where τ is the time constant for voltage decay, V f s is the full-scale voltage, V0 is the desired state, and N is the number of bits of precision. The worst case (the highest required refresh frequency) occurs when the desired state is equal to V f s . Figure 1b is a plot of the worst-case refresh frequency needed for various values of N, when the time constant for decay is 10 seconds. With this time constant, a 200 Hz refresh rate is needed to maintain a precision of 10 bits, while only a 12 Hz rate is needed to maintain 6 bits of precision. In addition to the voltage ripple caused by the continuous decay and recharge of the capacitor, there are other noise effects having to do with connecting the target capacitor to the voltage source during setting and refreshing operations. These effects, generally called clock feedthrough (e.g., Sheu and Hu 1983; Wilson et al. 1985), can be significant depending on the design of the switching circuitry, the input voltage, and the size of the memory capacitance. The noise voltage generated is due to two effects: (1) capacitive coupling between the gate and drain electrodes that allows a portion of the gate signal to couple into the holding capacitor and (2) mobile charge left in the channel between drain and source electrodes when the transistor is turned off. A portion of this charge flows to the holding capacitor, thus changing its state. 1.2 Floating Gates. This type of analog memory (Frohman-Bentchkowsky 1971) has several desirable characteristics: its charge retention time is extremely long, a memory cell requires only a single minimum-size transistor, and it is possible to set its state precisely. In these devices, there are essentially no pathways through which charge can flow because the capacitor serving as the memory is completely isolated from all other circuits. The capacitor is also the gate of a MOSFET and therefore, directly affects the transistor’s operation. The state of the capacitor can be set using Fowler-Nordheim tunneling (Lenzlinger and Snow 1969) or, in some implementations, using both tunneling and hot electron injection (Diori et al. 1995). Both methods require elevated voltages and a relatively long time to set the desired state compared to CAPS-and-DAC memory. In general, the
An Analog Memory Circuit for Spiking Silicon Neurons
423
charging rate of the capacitor (floating gate) is not well known. Therefore, the accuracy and precision to which the memory can be set is limited unless a feedback technique is used (Diori et al. 1995). Floating gate memory devices were proposed for use in artificial neural networks by Alspector (1987) and were integrated as a large array of over 10,000 memory elements by Holler et al. (1989). Figure 1c shows the basic architecture of a floating gate memory array in which each memory element needs to be accessed only once to set its state, which can be stable for years. Despite the many positive attributes of CAPS-and-DAC and floating gate analog memories, neither is very suitable as a network-dependent shortterm memory element in artificial nervous systems. (By short term, we mean time scales of tens of microseconds to hundreds of seconds.) To make CAPS-and-DAC memory devices network-dependent would require constant monitoring of network activity and computing a new state for each memory cell. This could potentially be done using a dedicated processor, but it would add considerably to system complexity. As a long-term adaptive memory element, devices based on floating gates are unequaled (Hasler et al. 1995), but they require high voltages for programming, and their dynamics depend significantly on these voltage levels. (By dynamics, we mean the rate of change in the memory state.) In practice, it may be difficult to change floating gate memory dynamics on an individual basis, since each would require a different high-voltage level. In our work, we sought a memory device that could be fabricated in arrays, that required no high-voltage supplies, and whose dynamics and state were easily set by pulsatile network activity. To this end, we developed a simple four-transistor analog memory circuit, which we call a flux capacitor. 2 Flux Capacitor The flux capacitor is based on techniques used with CAPS-and-DAC analog memory and switched capacitors (Allen and Sanchez-Sinencio 1984). It uses a conventional MOS capacitor to hold charge, but its state is set by two opposing local synapses: an upper that increases the capacitor voltage and a lower that decreases it. A diagram of the circuit is shown in Figure 2a. On every upper (Su ) synaptic activation, the holding capacitor, Ch , receives a small packet of charge from the upper capacitor, Cu . And with every lower (SL ) synaptic activation, Ch gives up a similarly small packet of charge to the lower capacitor, CL . Therefore, to set the flux capacitor’s state to any desired level requires a sequence of upper or lower synaptic activations. The effect that each activation has on the state of the flux capacitor is determined by the synaptic weight, which is set by the size of the synapse capacitors, Cu or CL , relative to the holding capacitor, Ch . Each synapse (see Fig. 2b) has three components: two MOS transistors that are series connected, drain to source, and a small capacitor that connects between their common node and ground. Lower synapses use n-channel
424
John G. Elias, David P. M. Northmore, Wayne Westerman
Figure 2: (a) Basic circuit for flux capacitor analog memory. Su and SL are the upper and lower synapses, respectively; Vu and VL are the corresponding synaptic supply potentials; state is held on Ch ; Cu and CL relative to Ch are the effective synaptic weights. Each synapse is made up of two MOS transistors in series. (b) Synapse circuit schematic and VLSI layout. Only two MOS transistors make up each synapse. The physical layout is minimum size and makes use of the parasitic diffusion capacitance between the transistors to affect Cu or CL . (c) Measured flux capacitor output voltage as a function of desired voltage computed from equation 2.4.
transistors; upper use p-channel. The two transistors of a particular synapse never turn on at the same time. Synaptic activation proceeds by first briefly (∼ 50 nsec) turning on the transistor closest to either of the supply voltage sources (Vu or VL ) to charge the synapse capacitor to that potential. This is followed by briefly turning on the transistor nearest the holding capacitor, which either adds or removes a certain amount of charge from Ch . The state and dynamics of the memory are set by the average charge flux entering and leaving the circuit through its synapses. The synapses
An Analog Memory Circuit for Spiking Silicon Neurons
425
are directly activated by afferent pulsatile signals coming from multiple sources (neuromorphs) anywhere in a network. A flux capacitor analog memory can represent transient or sustained information concerning the system, and, with the appropriate connections, can respond to a wide range of network activities. Several examples of its utility as a dynamic memory for spike firing threshold will be presented later. 2.1 Statics. The steady-state (average) flux capacitor voltage is given by V=
fu · Wu · Vu + fL · WL · VL + g · Vg , fu · Wu + fL · WL + g
(2.1)
where fu and fL are, respectively, the average frequencies at which the upper and lower synapses are activated, Wu and WL are the respective synaptic weights, and g accounts for the state decay due to charge leaking away to voltage source Vg . Here we make the assumption that the charge leakage between synaptic activations can be represented by a single-exponential function with time-constant RCh , where R represents the resistance of the various unwanted charge leakage pathways. The average period between synaptic activations is T. Equation 2.1 can also be solved for the ratio of upper to lower rates of synaptic activation given the desired voltage, V: WL · (V − VL ) + g · T · (V − Vg ) fu . = fL Wu · (V − Vu ) + g · T · (V − Vg )
(2.2)
Definitions for the various parameters are: Upper Weight Lower Weight
Wu =
Cu Cu +Ch
WL =
CL CL +Ch
Decay Factor
g=
1 T
³ −T ´ · 1 − e RCh
Average Synapse Activation Period
T=
1 fL + fu
(2.3)
.
Experimentally, RCh time constants have been found to be fairly long (∼700 sec with a Ch = 0.5 pF) compared to synaptic activation frequencies that, in our system, typically range between 0.1 and 500 activations per second. Equation 2.1 simplifies considerably under easily arranged conditions: equal synaptic weights (CL = Cu ), lower-supply voltage, VL , and leakage voltage, Vg , are zero; and the decay time constant, RCh , is much larger than the period between synaptic activations, T, V=
Vu 1+
fL fu
.
(2.4)
Figure 2c shows the static output of a flux capacitor as a function of the desired voltage computed from equation 2.4. The desired voltage determined
426
John G. Elias, David P. M. Northmore, Wayne Westerman
the activation frequency ratio of lower to upper synapses, which was then applied to the flux capacitor. To enable measurement of flux capacitor state, we used an on-chip source follower to buffer its output. Therefore, all reported measurements are approximately 1 volt lower than the actual flux capacitor voltage. In addition, memory voltages less than 1 volt could not be measured because of the limited lower range of the source-follower buffer. 2.2 Noise. The voltage on the holding capacitor, Ch , is always in a state of flux due to discrete synaptic events. Therefore the steady-state capacitor voltage will have a corresponding noise component. The magnitude of this noise will determine the precision to which the state of the flux capacitor can be set. The change in flux capacitor state due to each synaptic activation is given by 1Vu = (Wu + SNu ) · (Vu − Vh ) 1VL = (WL + SNL ) · (VL − Vh ),
(2.5)
where Vh is the state before the synapse is activated, 1Vu is the change in voltage due to activating an upper synapse, 1VL is the change due to activating a lower synapse, and SNx is a constant representing the charge injection due to switching (i.e., the clock feedthrough effect). The voltage step depends on the current state, the supply potential, the synapse weight, and the charge injection factor. Henceforth, references to flux capacitor synaptic weights will include the contribution due to charge injection. As with real synapses (e.g., Shepherd and Koch 1990) the change in membrane state due to synaptic activation is dependent on the instantaneous voltage of the local membrane and the synaptic conductance (weight). If the synaptic weights are equal, the maximum peak-to-peak noise due to individual synaptic activations is Vpp = 1Vu − 1VL = W · (Vu − VL ),
(2.6)
and the precision to which a particular state can be set in terms of bits is given by N = − log
Vpp 1 = log Vu − VL W
base 2 .
(2.7)
Therefore, for 8-bit precision, the holding capacitor must be at least 256 times larger than either synapse capacitor. Of course, the relative size of the capacitors will depend on the application. In our work, we have fabricated and tested flux capacitors that have Cu /Ch ratios between 100 and 2250. In all of our flux capacitor memories, the lower and upper synapse transistors are minimum size, and in most cases the synapse capacitances are equal
An Analog Memory Circuit for Spiking Silicon Neurons
427
value and derived solely from parasitic sources such as drain diffusions and short segments of wiring, which amounts to an effective capacitance on the order of 1–5 fF. The footprint required for the holding capacitor is not very large for low-precision applications. For example, a 6-bit flux capacitor requires a Ch footprint of about 25 × 25 square microns for a standard IC process available through MOSIS. The data shown in Figures 2, 3, and 4 were taken from a flux capacitor that had a Ch of approximately 0.5 pF, so its maximum effective precision was nearly 7 bits. Since there is the inevitable change in state due to charge leakage, the precision to which a particular state can be set will also depend on the average synaptic activation rate, T. In general, the activation rate must be significantly faster than the leakage time constant to minimize the reduction in bit precision—if bit precision is important for the application. Figure 3a shows experimental data plotting flux capacitor output voltage as a function of average activation rate when both upper and lower synapses were activated at the same rate. The experimental data are plotted along with the expected output computed using equation 2.1 with parameters set to their measured or estimated values. A constant voltage of 0.9 was added to the experimental data to compensate for the source-follower transfer function. As can be seen, the output from this flux capacitor is fairly constant for frequencies above approximately 10 Hz. An upper bound on the reduction in bit precision can be determined by equating the peak-to-peak noise voltage (cf. equation 2.6) to an expression that describes the decay in voltage due to charge leakage alone, as in equation 2.8 (left side). Thus, when the synaptic weights are equal (W = Wu = WL ), the degradation in precision from theoretical, given by equation 2.7, can be limited to 1 bit by ensuring that the average activation period, T, is no greater than that given in equation 2.8 (right side): ³ −T ´ W · (Vu − VL ) = (Vh − VL ) · 1 − e RCh µ
(Vu − VL ) · W T ≤ −R · Ch · log 1 − (Vh − Vl )
¶ .
(2.8)
2.3 Dynamics. Although the flux capacitor could be used as a type of static analog memory device whose bit precision is determined by decay and synaptic activation rates, its principal use is intended to be as a networkdependent dynamic memory. In nervous systems, there may exist many examples of adaptive, network-dependent effects such as membrane potentiation, modification of dendrite cable properties, or even structural changes that may have a short-term memory function. Such state changes could play an important role in modifying and regulating behavior. In artificial nervous systems, network-dependent, short-term memories may be essential for similar purposes. Although floating gate devices have been proposed as the basis for an adaptive memory (e.g., Hasler et al. 1995), we believe that
428
John G. Elias, David P. M. Northmore, Wayne Westerman
Figure 3: (a) Flux capacitor output as a function of average activation frequency (1/T) when both upper and lower synapses are activated at the same rate. Theoretical curve computed from equation 2.1 using the following parameters: WL = Wu = 0.01, RCh = 700 sec, Vu = 5.0, fu = fL , VL = 0, and Vl = 0. (b) Measured flux capacitor voltage decay from a set state after synaptic activation ceases. Two different devices are shown: flux capacitor 9 (FC9) had a 25 × 40 square micron MOS holding capacitor; flux capacitor 3 had a 115 × 115 square micron poly-poly capacitor. Calculated time constants are 700 sec for FC9 and 2700 sec for FC3.
An Analog Memory Circuit for Spiking Silicon Neurons
429
the flux capacitor offers more advantages for use in short-term, networkdependent memory applications. In particular, the flux capacitor requires no elevated voltages, its memory state and dynamics are determined by direct synaptic activation, and its state can be precisely set without the need for additional feedback or monitoring circuitry. Equation 2.5 shows that each synaptic activation produces either an upward or a downward movement of flux capacitor voltage. Dynamics therefore is entirely determined by the synaptic activation rate: the more slowly flux capacitor synapses are activated, the slower its state changes (this assumes that the leakage time constant is long compared to the average rate of synaptic activation). The fastest rate at which synapses can be activated is dependent on the connection system employed (Mahowald 1991; Elias 1993). With our existing virtual wire connection system, flux capacitor synapses could be activated as fast as 107 activations per second. The state trajectory that the flux capacitor follows depends on the ordering of synaptic activations. The final state, however, is independent of the order of activating a fixed number of upper and a fixed number of lower synapses, provided that the supply potentials (Vu or VL ) were not encountered at some point along the state trajectory. Figure 4 shows three examples of the flux capacitor state following a prescribed path. Each trajectory was generated by a particular temporal pattern of upper and lower synaptic activations applied to a flux capacitor through our virtual wire network connection circuit. The trajectories can easily be made to fit in a wide range of different time scales. As an example, the state trajectory shown in Figure 4b was generated with equal fidelity over several time scales ranging from 800 µsec to 80 sec. The limit to how far the time scale can be stretched or compressed depends, respectively, on the leakage decay rate and the maximum rate at which synapses can be activated. Since the flux capacitor state can be rapidly changed and set by synaptic activations, a question arises on how to embed the circuit in a network in order to produce desirable (i.e., appropriately adapting) behavior. In the next section, we present several experiments that show interesting adaptation responses to specific network activity when the flux capacitor state is used as the spike firing threshold for silicon neurons. 3 Network-Dependent Spike Threshold 3.1 Hardware System. Network-dependency experiments were conducted using our silicon neuron or neuromorph (Elias and Northmore 1995) having an integrated flux capacitor as a means of setting the spike firing threshold of its integrate-and-fire “soma” (see Fig. 5). The dendritic branches of the neuromorph are composed of a series of compartments, each with a capacitor representing a membrane capacitance, and two programmable resistors representing the axial cytoplasmic and membrane resistances. Synapses at each compartment (small vertical lines in Fig. 5) are
430
John G. Elias, David P. M. Northmore, Wayne Westerman
Figure 4: State trajectories measured from a flux capacitor using an on-chip source follower. The temporal patterns of synaptic activation that produced the trajectories were delivered to flux capacitor synapses by a network connection circuit called virtual wires (Elias 1993). (a) Sine. (b) Damped sine. (c) Sum of three sine waves. The same or arbitrary trajectories can be generated over widely different time intervals. For example, data shown in (b) were collected using several different sampling periods spanning 800 µsec to 80 sec.
emulated by a pair of MOSFETs, one of which, the excitatory synapse, enables inward current, moving the potential of the membrane capacitance in a depolarizing direction. The other transistor, emulating an inhibitory synapse, enables outward current, hyperpolarizing the membrane capacitance. There are also shunting or silent inhibitory synapses that pull the membrane potential toward a potential close to the resting value. The potential appearing at the soma (Vs ; see Fig. 5) determines the output spike firing rate in conjunction with the integration time constant, RC, and the threshold, Vth . The spike firing threshold is derived directly from a flux capacitor whose state is set by network spiking: activating the upper threshold
An Analog Memory Circuit for Spiking Silicon Neurons
431
Figure 5: A silicon neuron with an artificial dendritic tree (ADT) and integrateand-fire “soma.” Vs is “membrane voltage” at the soma; R and C determine the integration time constant; R is programmable. The spike firing threshold, Vth , is the state of a flux capacitor, which is set by network activity: activating upper threshold synapse (filled circle) raises spike firing threshold voltage; activating lower synapse reduces it.
synapse (filled circle) raises the spike firing threshold voltage; activating the lower threshold synapse (open circle) reduces it. All synapses (on dendrites and on threshold-setting flux capacitor) are activated by a 50 nsec impulse signal. Experiments were performed using multiple neuromorphs having fourand eight-branched dendritic trees. The dendrite dynamics was set to give membrane time constants in the range of 10 msec. The neuromorphs were embedded in the “virtual wire system” previously described (Elias 1993). This system allows spikes generated by thousands of neuromorphs to be distributed with programmable delays to arbitrary synapses throughout a network. A host computer generated the spatiotemporal patterns of spikes used as test stimuli for the network. It also recorded the soma membrane potential, Vs , its spiking threshold, Vth , and read the “spike history,” which is part of the virtual wire system, to obtain the times of occurrence of the output spikes generated by all of the neuromorphs. 3.2 Sustained and Transient Responses. Figure 6a depicts connections to a neuromorph showing how its excitability can be controlled by network activity. In the absence of other inputs to the threshold-setting synapses, a fixed threshold for the integrate-and-fire soma is achieved by delivering tonic spike trains with frequencies fu and fL to the upper and lower threshold-setting synapses of the soma (see equation 2.1). Using the virtual wire system, these spike trains could be derived from the activity of any neuromorph in the network, but for experimental purposes, they are conveniently generated by pacemaker neuromorphs whose uniform spiking rates can be easily programmed. Because of the long time constant of the flux capacitor (∼ 700 sec), the pacemaker frequencies need to be only a few spikes per second.
432
John G. Elias, David P. M. Northmore, Wayne Westerman
Figure 6: (a) Two-branch dendritic tree neuromorph. A spike train of 1 sec duration is input to a proximal excitatory synapse on one branch. Tonic spike trains of frequency fu and fL are delivered to the upper and lower threshold-setting synapses along with various amounts of feedback to upper synapse from output. (b) Soma voltage (Vsoma ), and the threshold voltage (Vth ), as a function of time, together with output spikes. At 0.5 sec (arrow), a 100 spikes per second input train was applied to the excitatory synapse for 1 sec while fu = 0 and fL = 10. When output stops firing (∼ 1.5 sec) threshold slowly returns to the previous level due to tonic activation of lower threshold synapse. (c) Spike output frequency for four different combinations of fu and fL and feedback. Curve 1: fu = 12.5, fL = 20, no feedback; curve 2: fu = 42, fL = 111, feedback = one upper activation per output spike; curve 3: fu = 0, fL = 12.5, feedback = one upper activation per output spike; curve 4: fu = 0, fL = 12.5, feedback = two upper activations per output spike.
An example of self-regulation of neuromorph excitability occurs when tonic spike trains are supplied to the threshold-setting synapses, and the neuromorph’s own output spikes are fed back with zero delay to its thresholdsetting synapses. The tonic spike input sets the spike firing threshold according to equation 2.1, which, if below the “membrane-resting potential,”
An Analog Memory Circuit for Spiking Silicon Neurons
433
produces output spiking in the absence of any dendritic input. The feedback connections adjust the spike firing threshold (cf. equation 2.5) every time the neuromorph fires. In the absence of any synaptic input to the dendritic tree, the neuromorph achieves an average firing rate of f0 spikes per second given by −1 ³ ´, (3.1) f0 = Vu RC · log 1 − Vsoma ·(1+ fL /fu ) where Vsoma is the “membrane potential,” RC is the integrate-and-fire soma time constant (see Fig. 5), and fL and fu are the activation frequencies of the lower and upper threshold synapses. The threshold synapse activation frequencies are derived from the feedback of the neuromorph’s own output and the outputs of any in the network: fu = a∗ f0 + Pother neuromorphs P P P ∗ flower , where fupper and flower represent the fupper ; fL = b f0 + summed activation frequency due to all other neuromorphs whose outputs connect to the threshold setting synapses; a and b are the number of feedback connections to the upper and lower threshold synapses, respectively. Figure 6b shows the response evoked by a 100 Hz input spike train of 1 sec duration, delivered to an excitatory synapse on the dendritic tree. The soma potential, Vsoma , and the threshold potential, Vth , are plotted together with a sample train of output spikes. A tonic spike train of 12.5 spikes per second was applied to the lower threshold-setting synapse, and the neuromorph’s output was fed back to its upper threshold synapse, which resulted in an average output firing rate of around 10 spikes per second. During the 1 sec input train (beginning at the arrow), Vsoma is raised 240 mV, evoking a transient burst of output spikes followed by sustained firing at a slower rate. The output response accommodates because each fed-back output spike raises the soma threshold according to equation 2.5. At the end of the input train, and for more than a second after, the threshold exceeds Vsoma , and spike firing ceases. With each activation of the lower threshold synapse, Vth drops according to equation 2.5. Eventually Vth falls below Vsoma , causing the neuromorph to fire at a regular rate again. Figure 6c (curve 3) shows the instantaneous output firing frequency obtained with these connections. The sustained response component relative to the transient component can be emphasized by increasing the rate of lower threshold synapse activation with respect to the feedback activation rate of the upper threshold synapse. With no feedback, a pure sustained response is observed (curve 1 in Fig. 6c). With a single feedback activation, increased transient response occurs (curves 2 and 3). Increasing the number of feedback activations of the upper synapse from each output spike generates sharper transient responses (curve 4). In the recordings of curves 1 and 2, an additional tonic input to the upper threshold synapse was employed to sustain approximately the same average firing rate in the absence of input as with the recording of curve 3 (see equation 3.1).
434
John G. Elias, David P. M. Northmore, Wayne Westerman
Figure 7: (a) A two-branch dendritic tree neuromorph with connections for bistable behavior. A 1-second-long input spike train of 100 spikes per second was delivered to either a strong excitatory synapse (open circle on dendrite) or to a strong inhibitory synapse (filled circle). Tonic spike trains of 20 and 25 per second activated upper and lower threshold-setting synapses, respectively. Each output spike activated the upper threshold synapse twice and the lower threshold synapse four times. (b–e) Soma voltage (Vsoma ), threshold (Vth ), and peri-stimulus time histogram of output spike firing. (b) Response to excitatory input in low threshold state. Excitatory input does not cause low-to-high threshold state transition. (c) Inhibitory input effects change from low to high threshold state. (d) Inhibitory input in high threshold state does not affect state. (e) Excitatory input changes state from high to low threshold.
3.3 Bistable Behavior. The connections to the threshold-setting synapses shown in Figure 7a cause Vth to move to one of two defined levels, allowing the neuromorph to exhibit bistable behavior. Tonic spike trains of 20 and 25 spikes per second activating the upper and lower synapses alone move Vth to ∼ 2.22 V (cf. equation 2.4, Vu = 5). This threshold value prevents the neuromorph from firing spontaneously. However, when the neuromorph receives a brief burst of excitatory dendritic input, thus causing it to fire, Vth moves low enough (∼ 1.7 V) due to feedback activations to enable continuous firing in the absence of excitatory input. The soma firing frequency is then sufficient to hold the threshold low and overcome the tendency of the tonic spike trains to raise the threshold. Figure 7b shows the neuromorph firing at a spontaneous rate of about 82 per P second (cf. equation 3.1 with P fupper = 20, flower = 25), RC = .005, Vu = 5, Vsoma = 1.88, a = 2, b = 4, the rate rising temporarily under the influence of a 1 sec train of spikes to an excitatory synapse on the dendrite. In Figure 7c, a strong inhibitory input
An Analog Memory Circuit for Spiking Silicon Neurons
435
to the dendrite shuts off firing, allowing Vth to rise to its upper level under the influence of the tonic input spikes. It remains in the high-threshold state (see Fig. 7d), until a strong excitatory input on the dendrite generates sufficient output spiking to switch the neuromorph to the low-threshold state again (see Fig. 7e). The neuromorphic circuit is capable of processing synaptic inputs in either state, provided that these are not strong enough to cause a change of state. In the high-threshold state, inputs are subjected to a thresholding nonlinearity. Excitatory input trains that are brief and relatively weak in their depolarizing effects (e.g., 50 spikes per second for 100 msec activating a proximal synapse) elicit just a few output spikes without causing a state change. In the low-threshold state, there is a much more linear relationship between output and input. 3.4 Classical Conditioning. As a further illustration of the application of the flux capacitor, Figure 8a shows a neuromorphic network that exhibits the main features of classical conditioning. The unconditioned stimulus (US) and conditioned stimulus (CS) are input trains of 100 Hz, each lasting 0.5 sec. Presentation of the US alone reliably excites the Output neuromorph, leading to an unconditioned response. The CS by itself is normally unable to excite Output but may do so via the Associator neuromorph, if the Associator threshold is sufficiently low. Lowering this threshold is the function of the Correlator, which fires when coincident spikes occur in the US and CS trains (Northmore and Elias 1996). Thus during training, when CS and US are paired as shown in Figure 8b, Correlator spikes delivered to the lower synapse of the Associator’s flux capacitor gradually decrement its Vth , until the CS alone is able to excite Output. A mechanism to limit threshold lowering in the Associator is required to prevent its firing excessively and to raise its threshold when CS and US are uncorrelated (i.e., to extinguish the conditioned response). The threshold synapses of the Associator are not driven by tonic spike trains because this would shorten the time constants of Vth changes. The role of the Differentiator, then, is to convert trains of spikes from several sources into brief bursts that raise the Associator’s threshold. This is achieved by feeding back the Differentiator’s output spikes to its own threshold-setting synapses in the ratio 4:2 (upper:lower). The effect of six threshold-setting synaptic activations per output spike ensures that the Differentiator’s threshold moves quickly to a higher level; this tends to shut off further firing. The same principle was used to produce the transient responses shown in Figure 6. Figure 8b shows a typical sequence of spike trains before and during conditioning and during extinction with CS alone. Prior to training, the output response is evoked by the US but not the CS. As training proceeds by pairing of CS and US so that they overlap in time, the Correlator fires, lowering the Associator’s threshold. After a few pairings of CS and US, the Associator’s response gains in vigor and comes to anticipate the US. The CS alone is then sufficient to elicit a strong output response. Then, however, continued
436
John G. Elias, David P. M. Northmore, Wayne Westerman
Figure 8: Classical conditioning in a four-neuromorph network. (a) Inputs to the network are the unconditioned stimulus (US) and the conditioned stimulus (CS), both 100 Hz trains of 0.5 sec duration. All neuromorphs have four equal dendritic branches. Open circles on the dendrites show excitatory synaptic connections; filled circles, inhibitory. Circles on the neuromorph soma are the upper and lower threshold-setting synapses. The Correlator and Output neuromorph thresholds are set by tonic spike trains to their upper and lower synapses (connections not shown). Feedback to the Differentiator threshold synapses is in the ratio 4:2, upper to lower. Spike firing activity of all four neuromorphs is shown in (b). Left column, upper: before training, US alone evokes output response; CS alone evokes no activity. Training occurs by overlapped paring of CS and US (lower left and upper right columns). The Correlator fires during the overlap of the CS and US. The Associator gradually starts responding and, with the Output response, comes to anticipate the US. Lower right column: CS presented alone. Output fires but gradually extinquishes because Differentiator acts to raise Associator’s threshold.
An Analog Memory Circuit for Spiking Silicon Neurons
437
firing of the Differentiator drives up the Associator’s threshold, leading to extinction of the conditioned response. In the absence of input activation, the persistence of the conditioned association, once formed, depends only on the decay of flux capacitor state. 4 Discussion Negative feedback, provided by routing output spikes to the upper threshold synapse, allows neuromorphs to adjust their own excitability, thus stabilizing the spike firing frequency of the integrate-and-fire soma. Negative feedback also minimizes the effects due to variation between individual neuromorphs. However, it is control over the flux capacitor dynamics that provides temporal processing capabilities of generally longer duration than those obtained through the dendritic tree alone (Northmore and Elias 1996). If the threshold is made to rise rapidly with output firing, the strongly transient responses shown in Figure 6 result. With the addition of interneuromorphs, feedback spike frequency can be divided down, making for much slower rates of change in excitability. Thus, on a relatively short time scale, rapidly adapting effects giving temporal edge enhancement can be obtained; with longer dynamics, habituating type effects may be obtained. Sensory systems commonly contain neurons with different temporal signaling characteristics; some are transient in nature, responding best to stimulus onsets and offsets, while others respond in a more sustained fashion. Such differences may be evident at the level of sensory receptors (e.g., mechanoreceptors in the skin), or they may be accentuated by subsequent neural filtering. Thus, for example, high-pass filtering in the insect compound eye (McLaughlin 1981) and in the vertebrate retina (Dowling 1987) generates signals that mark only onsets or offsets of stimulation. Sensory systems generally transmit both sustained and transient information in parallel pathways (e.g., M- and P-cell systems; RA and SA in somatosensory system), probably as a way of maximizing channel capacity. Transients convey information that commands attention and possibly rapid response, whereas sustained responses signal the continued presence of stimuli, allowing a slower, more discriminative processing. The feedback control of neuromorph excitability makes possible a range of sustained and transient signaling that could be as important for robotics as it is for biological sensory systems. A useful property of the flux capacitor is that it can be set to particular voltage levels depending on the ratios of activations of the upper and lower synapses. Moreover, the level can be set by prolonged trains or by very brief bursts of activations, the level depending almost entirely on the ratio of the number of upper to lower synapse activations (see equation 2.4). This property allows the threshold to settle at various levels depending on which spike source dominates the threshold-setting mechanism at any
438
John G. Elias, David P. M. Northmore, Wayne Westerman
given time. The resulting bistable behavior is reminiscent of the two or more modes exhibited by neurons at various sites in the central nervous system, most notably the thalamus (McCormick et al. 1992) where input spikes are processed differently according to the threshold state. In the neuromorphic circuit illustrated in Figure 7, processing is approximately linear for small signals in the low threshold state and nonlinear in the high threshold state. The classical conditioning network was intended as a small-scale demonstration of how flux capacitors in a network can be used for local memory storage. In this illustration, they are used either as an effective weight storage element that may be altered by training (e.g., in the Associator) or as a memory for determining the temporal characteristics of neuromorph response (as in the Differentiator). We envision that large numbers of flux capacitors, being compact, could be distributed throughout networks as short- to intermediate-term storage to control synaptic weights and other neuromorph parameters. The “virtual wire” interconnection system (Elias 1993) that made the tiny classical conditioning network possible is designed to route spikes generated by thousands of sources, including neuromorphs and external sensors, to even larger numbers of synaptic destinations, many of which will be flux capacitor storage elements. 5 Conclusions We have presented a design for a new type of simple, four-transistor integrated analog memory circuit, the flux capacitor. Its state and dynamics are set by the average charge flux through its two synapses, which are directly activated with pulsatile signals originating from an arbitrary number of neuromorph sources. The principal use of the flux capacitor is intended to be as a network-dependent dynamic memory in artificial nervous systems. Its state may be incremented and decremented over a wide voltage range, as well as being set to particular levels, all by means of the spiking activity generated in a network. When its state is used as a threshold voltage for an integrate-and-fire neuromorph, output firing adaptation is possible, as demonstrated here. Work is underway to use arrays of flux capacitors to bring dendrite dynamics, soma integration time constants, and synaptic conductances under the control of activity in a network. Acknowledgments The research reported in this article was supported by National Science Foundation grants BCS-9315879 and BEF-9511674. References Advanced Micro Devices. 1995. Data sheet for AM29F010 Flash Digital Memory. Sunnyvale, CA.
An Analog Memory Circuit for Spiking Silicon Neurons
439
Alspector, J. 1987. A neuromorphic VLSI learning system. Proc. 1987 Stanford Conf. Advanced Research in VLSI, 313–349. Allen, P. E., and Sanchez-Sinencio, E. 1984. Switched Capacitor Circuits. Van Nostrand Reinhold, New York. Diori, C., Hasler, P., Minch, B., and Mead, C. 1995. A high-resolution non-volatile analog memory cell. IEEE Int. Symposium on Circuits and Systems, 2233–2236. Douglas, R., and Mahowald, M. A. 1995. Personal communication. Dowling, J. 1987. The Retina: An Approachable Part of the Brain. Harvard University Press, Cambridge, MA. Elias, J. G. 1993. Artificial dendritic trees. Neural Comp. 5, 648–664. Elias, J. G., and Northmore, D. P. M. 1995. Switched-capacitor neuromorphs with wide-range variable dynamics. IEEE Trans. Neural Networks 6 (6), 1542–1548. Frohman-Bentchkowsky, D. 1971. Memory behavior in a floating-gate avalanche-injection MOS (FAMOS) structure. Applied Physics Letter 18. Hasler, P., Diori, C., Minch, B. A., and Mead, C. 1995. Single transistor learning synapses. Advances in Neural Information Processing Systems 7, 817–824. Holler, M., Tam, Si., Castro, H., and Benson, R. 1989. An electrically trainable artificial neural network (ETANN) with 10240 “floating gate” synapses. International Joint Conference on Neural Networks, Washington, DC, II-191–196. Lee, B. W., Sheu, B. J., and Yang, H. 1991. Analog floating-gate synapses for general-purpose VLSI neural computation. IEEE Trans. on Circuits and Systems 38 (6), 654–658. Lenzlinger, M., and Snow, E. H. 1969. Fowler-Nordheim tunneling into thermally grown SiO2. J. Appl. Phys. 40, 278–283. Mahowald, M. A. 1991. Evolving analog VLSI neurons. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., Chap. 15. Academic Press, San Diego, CA. Mahowald, M. A., and Douglas, R. 1991. A silicon neuron. Nature 354, 515–518. Mead, C. 1989. Analog VLSI and Neural Systems. Adaptive retina. In Analog VLSI Implementations of Neural Systems. C. Mead and M. Ismail, eds., 239–246. Kluwer, Norwell, MA. McCormick, D. A., Huguenard, J., and Strowbridge, B. W. 1992. Determination of state-dependent processing in thalamus by single neuron properties and neuromodulators. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., 8: 1245–1265. Academic Press, San Diego, CA. McLaughlin, S. B. 1981. The role of sensory adaptation in the retina. J. Exp. Biol. 146, 39–62. Murray, A., and Tarassenko, L. 1994. Analogue neural VLSI—A Pulse Stream Approach. Chapman & Hall, London. Northmore, D. P. M., and Elias, J. G. 1996. Spike train processing by a silicon neuromorph: The role of sublinear summation in dendrites. Neural Comp. 8, 1245–1265. Shepherd, G. M., and Koch, C. 1990. Dendritic electrotonus and synaptic integration. In Synaptic Organization of the Brain, G. M. Shepherd, ed., 439–473. Oxford University Press, New York. Sheu, B. J., and Hu, C. M. 1983. Modeling the switch-induced error voltage on a switched-capacitor. IEEE Trans. on Circuits and Systems cas-30 (12), 911–913.
440
John G. Elias, David P. M. Northmore, Wayne Westerman
Shoemaker, P., Lagnado, I., and Shimabukuro, R. 1988. Artificial neural network implementation with floating gate MOS devices. In Hardware Implementation of Neuron Nets and Synapses, a workshop sponsored by NSF and ONR, San Diego, CA. Wilson, W. B., Massoud, H. Z., Swanson, E. J., George, R. T., and Fair, R. B. 1985. Measurement and modeling of charge feedthrough in e-channel MOS analog switches. IEEE J. Solid-State Circuits SC-20 (6), 1206–1213.
Received January 19, 1996; accepted June 4, 1996.
Communicated by Christopher Bishop and John Platt
Average-Case Learning Curves for Radial Basis Function Networks Sean B. Holden Department of Computer Science, University College London, Gower Street, London WC1E 6BT, England
Mahesan Niranjan Cambridge University Engineering Department, Trumpington Street, Cambridge CB2 1PZ, England
The application of statistical physics to the study of the learning curves of feedforward connectionist networks has to date been concerned mostly with perceptron-like networks. Recent work has extended the theory to networks such as committee machines and parity machines, and an important direction for current and future research is the extension of this body of theory to further connectionist networks. In this article, we use this formalism to investigate the learning curves of gaussian radial basis function networks (RBFNs) having fixed basis functions. (These networks have also been called generalized linear regression models.) We address the problem of learning linear and nonlinear, realizable and unrealizable, target rules from noise-free training examples using a stochastic training algorithm. Expressions for the generalization error, defined as the expected error for a network with a given set of parameters, are derived for general gaussian RBFNs, for which all parameters, including centers and spread parameters, are adaptable. Specializing to the case of RBFNs with fixed basis functions (basis functions having parameters chosen without reference to the training examples), we then study the learning curves for these networks in the limit of high temperature. 1 Introduction Statistical physics has made significant contributions to the study of learning curves for feedforward connectionist networks. (It has also been applied with great success to the analysis of recurrent networks, although we do not consider the latter here. An excellent reference is Amit 1989.) In this context, a learning curve can be defined as a graph of expected generalization performance against training set size; the term is defined in full below. Most of the research that has appeared to date has been concerned with networks that do not have hidden layers and are applied to the learning of similarly structured rules—for example, a perceptron learning another perceptron Neural Computation 9, 441–460 (1997)
c 1997 Massachusetts Institute of Technology °
442
Sean B. Holden and Mahesan Niranjan
from examples (see, for example, Seung et al. 1992). Given that the use of simple networks such as these in practice is quite rare, the study of more complex networks and rules constitutes one of the most important directions for continuing research in this area (Watkin et al. 1993). Most of the analysis to date for more complex networks has been confined to committee machines (see, for example, Schwarze et al. 1992 and Barkai et al. 1992) and parity machines (Watkin et al. 1993 and references therein), which are again not networks that are applied to any significant extent in practice. This article presents results on the extension of the theory to the class of radial basis function networks (RBFNs), which have been used with considerable success in practical applications (see, for example, Renals & Rohwer 1989 and Niranjan & Fallside 1988). The learning curves for these networks have already been studied in detail in a predominantly worst-case framework based on probably approximately correct (PAC) learning theory (Anthony & Biggs 1992); the results of these studies can be found in Holden and Rayner (1995), Anthony and Holden (1994), and Lee et al. (1994). In this article, we predominantly consider RBFNs having fixed basis functions (that is, basis functions having parameters chosen without reference to the training examples). Section 2 reviews RFBNs, defines the learning problems of interest, provides a brief review of the areas of statistical physics that are relevant to the article, and maps out the mathematical development of the rest of the article. Section 3 addresses the first step in the analysis of a network using this formalism, namely, to calculate the generalization error of the network for the tasks of interest. The generalization error in this context is the expected error, for randomly selected input vectors, of a network having a given set of parameters. Section 4 addresses the derivation of learning curves. It has not been possible to date to obtain completely general results in this part of the analysis, and consequently we investigate networks having fixed basis functions and consider learning curves for the high-temperature limit. Section 5 discusses the results obtained, along with some related work (Freeman & Saad 1995), and provides some suggestions for further work. Section 6 concludes the article.
2 Preliminaries This section briefly reviews the theory of RBFNs and fully defines the learning problems of interest. It then provides a short introduction to the areas of statistical physics relevant here and explains the mathematical development of the rest of the article. We do not provide a full account of the statistical physics relevant to the analysis of learning curves; such an account can be found in Seung et al. (1992), Watkin et al. (1993), Hertz et al. (1991), and Landau and Lifshitz (1980).
Average-Case Learning Curves for Radial Basis Function Networks
443
2.1 General Notation. We consider feedforward networks having p realvalued inputs, denoted by the vector x ∈ Rp (vectors are assumed to be column vectors throughout this paper). Input vectors are assumed to be generated independently at random according to a probability density p(x). The network computes a function f (x, w) : Rp × Rq → R that depends on a vector w ∈ Rq of q real-valued parameters. Note that we address networks with real-valued outputs; networks of this type are typically applied to problems such as function interpolation, time-series prediction, and so forth. We consider functions f of a particular form, which will be discussed in the next subsection. In general, networks having binary-valued inputs, outputs, or weights can also be investigated using this formalism; however, we do not address such networks here. 2.2 Radial Basis Function Networks. We focus on RBFNs, which were introduced by Broomhead and Lowe (1988) and have a strong theoretical basis in interpolation theory (see, for example, Powell 1985, 1990). A convenient way in which to write the function computed by a neural network with a single hidden layer of k units is f (x, w) = α0 +
k X
αi φ(x, β i ) ,
i=1
where w is the vector of all variable weights, w T = [ α0
···
αk
β T1
···
β Tk ] ,
and φ(·) is called a basis function. In this article we assume that α0 = 0. For the case of RBFNs, we use φ(x, β i ) = φ(σi , kx − ui k) , where the ui are called the centers of the basis functions, and k·k is a norm that in this article is always the Euclidean norm. Several forms are available for the function φ; however, we will use the most popular: the gaussian basis function, ¸ · 1 φ(x, β i ) = exp − 2 kx − ui k2 . σi
(2.1)
The vectors β i of parameters can be written β Ti = [ σi uTi ]. For gaussian basis functions, σi controls the width of the ith basis function. In applying these networks in their full generality, it is usual to adapt all the available parameters in response to training examples. In this case we have q = k(p + 2) variable parameters in total. However, good results can also be obtained using simplified networks in which only the parameters αi
444
Sean B. Holden and Mahesan Niranjan
are adapted, and hence q = k. In the latter case we deal with networks that compute functions, · ¸ k 1 X 1 αi exp − 2 kx − ui k2 , f (x, α) = √ σi k i=1 where the ui are fixed and distinct, the σi are fixed, and αT = [ α1 α2 · · · αk ] is the vector of adaptable weights. For our purposes, we assume that when only weights in α are adapted, the centers and the σi parameters are chosen without reference to the training examples. For example, centers can be chosen randomly but cannot be chosen using clustering techniques (Moody & Darken 1989) or such that they coincide with a subset of the training examples (Broomhead & Lowe 1988). An important observation can be made at this point. When only the parameters in α are adaptable, the networks of interest operate by mapping input vectors of dimension p to a new space of dimension k, and mapping the resulting vectors to an output using a linear network having parameter vector α. Many results are available for such linear networks (Watkin et al. 1993; Seung et al. 1992); however, such results usually apply to networks having gaussian-distributed inputs, whereas in this case such inputs are clearly non-gaussian, as the outputs of the basis functions of equation 2.1 are strictly positive. Some of the later results presented here can therefore be regarded as an extension of earlier results to certain situations where inputs are not gaussian-distributed. 2.3 Target Rules and the Learning Problem. We study the supervised learning of a rule from a sequence of P noise-free training examples. We let P = ψq where q is the number of variable parameters used by the network, and we assume that training examples are generated using a target rule fT : Rp → R. The training sequence TP is generated by drawing P inputs independently at random according to a fixed probability density p(x) and forming TP = ((x1 , fT (x1 )), . . . , (xP , fT (xP ))). Throughout this article we assume that p(x) is the gaussian density function, · ¸ 1 T 1 exp − x x . p(x) = 2 (2π)p/2
(2.2)
We consider two types of target rule. In the first, fT has the same form as the RBFN of interest, and the function fT therefore has the form " # k 1 X 1 2 α i exp − 2 kx − ui k , fT (x) = √ σi k i=1
Average-Case Learning Curves for Radial Basis Function Networks
445
where α i , σ i , and ui for i = 1, 2, . . . , k are fixed parameters that define fT . Overbars are used in this manner throughout the article to denote parameters associated with the target rule. Note that if ui = ui and σi = σi for i = 1, 2, . . . , k, then the target rule is realizable; if the stated equalities do not hold, then this may not be the case. The second type of target rule considered is a simple linear rule, 1 fT (x) = √ γ T x , p where γ ∈ Rp is a fixed vector of parameters. 2.4 Statistical Physics. We begin with a few definitions. The error ²(x, w) of a network on a given input is, as usual, a measure of the distance between its output and the output of the target rule for the corresponding input, ²(x, w) =
1 [ f (x, w) − fT (x)]2 . 2
Further, we use the usual measure for the training error, E(w, TP ), defined as E(w, TP ) =
P X
²(xi , w)
i=1
for the xi in the training sequence TP . We denote by Z ²(w) =
²(x, w)p(x) dx
(2.3)
the expected error of the network for a given weight vector, which is called the generalization error for that weight vector. The calculation of this quantity forms the first, crucial step in obtaining expressions for learning curves using this formalism, and its calculation for the target rules of interest is the subject of Section 3. A central assumption in much of the treatment of learning curves using statistical physics is that the network is trained using a noisy dynamics leading at equilibrium to the Gibbs density (Watkin et al. 1993; Seung et al. 1992; Levin et al. 1990), pG (w) =
1 p(w) exp[−βE(w, TP )] , Z
where β = 1/T; T denotes the temperature; p(w) is a prior density on the weights; and the relevant partition function is Z Z= p(w) exp[−βE(w, TP )] dw .
446
Sean B. Holden and Mahesan Niranjan
The training dynamics can be considered intuitively to be a form of noisy gradient descent, and the temperature, in this context, denotes the amount of noise present in the training dynamics. It is usual in this type of analysis to use the prior p(w) to impose constraints on the ranges of the weights. We will not use this approach here but this point is discussed in more detail in Section 4. Averages with respect to the Gibbs density, referred to as thermal averages, are as usual denoted by h·iT , and similarly, quenched averages over R QP P-element training sequences are denoted by hh·ii = i=1 p(xi )dxi such that the quenched average of a random variable r is Z (2.4) hhr(x1 , . . . , xP )ii = r(x1 , . . . , xP )p(x1 ) · · · p(xP )dx1 · · · dxP . The primary quantity of interest in this article is the expected generalization error, ²g (P, T) = hh h²(w)iT ii . Plots of ²g (P, T) against P for fixed T are called the learning curves for the network of interest. We will also be interested in the expected training error, ²t (P, T) = hh hP−1 E(w, TP )iT ii . It is straightforward to verify that ²t (P, T) = P−1
∂(βF) , ∂β
(2.5)
where F = −Thhln Zii
(2.6)
is the free energy. 2.5 The High-Temperature Limit. We will use a simplified version of the full theory known as the high-temperature limit (Seung et al. 1992). In the high-T limit, we let T → ∞ and ψ → ∞ while ψβ = constant. In this case the training error E(w, TP ) can be replaced by its average value hhE(w, TP )ii, and the expected training and generalization errors become equal. Consequently, the partition function takes the form Z £ ¤ p(w) exp −qψβ²(w) dw , (2.7) Z0 = and it is straightforward to show using equations 2.5 and 2.6 that ²g (P, T) = −
1 ∂ ln Z0 . P ∂β
(2.8)
Average-Case Learning Curves for Radial Basis Function Networks
447
The high-T limit has been used in the analysis of other types of networks by Schwarze et al. (1992) and Sompolinsky et al. (1990). 2.6 Outline of the Remainder of This Article. The remainder of this article is structured as follows. Note that equations 2.3, 2.7, and 2.8 tell us exactly how to calculate ²g (P, T), the quantity of interest. The first step is to calculate ²(w) using equation 2.3. This is the aim of Section 3: Section 3.1 calculates ²(w) for the case of nonlinear target rules, and Section 3.2 calculates ²(w) for the case of linear target rules. The partition function Z0 can then be obtained by calculating the integral given in equation 2.7, and finally ²g (P, T) can be obtained by differentiating (ln Z0 ) (equation 2.8). These steps are performed in Section 4. 3 The Generalization Error ²(w) A central quantity in the analysis of neural networks using this approach, and the first thing that we must calculate, is the generalization error ²(w), defined in equation 2.3. This quantity can be derived for the networks and target rules of interest for cases in which all parameters in the network are variable (ui and σi are not assumed to be fixed for i = 1, 2, . . . , k as described above), assuming that the density p(x) is the gaussian density of equation 2.2. Although later we specialize to the case of more restricted networks, we include the fully general derivation here as it is likely to prove useful in further work on the statistical physics of these networks. In the following sections we use some results for integrals having particular forms. These results are summarized in Appendix A. 3.1 Nonlinear Target Rules. The network of interest computes the function, k 1 X αi φ(x, β i ) , f (x, w) = √ k i=1 where φ is the gaussian basis function of equation 2.1. In this case the target rule is k 1 X α i φ(x, β i ) fT (x) = √ k i=1 for fixed, unknown weights α i , β i where i = 1, 2, . . . , k, and we can therefore write the error ²(x, w) of a network with weight vector wT = [ αT β T1 · · · β Tk ] and inputs x as ¤2 1£ f (x, w) − fT (x) 2 " #2 k k X 1 X = αi φ(x, β i ) − α i φ(x, β i ) . 2k i=1 i=1
²(x, w) =
448
Sean B. Holden and Mahesan Niranjan
Note that in the interest of generality, we regard for the time being the vectors P P Pk β i as containing variable weights. Introducing the notation ij ≡ ki=1 j=1 we can write,
²(x, w) =
X 1 X αi αj φ(x, β i )φ(x, βj ) + α i αj φ(x, β i )φ(x, βj ) 2k ij ij X αi αj φ(x, β i )φ(x, βj ) , −2 ij
and, hence, Z ²(w) =
²(x, w) p(x) dx Z 1 X αi αj φ(x, β i )φ(x, βj ) p(x) dx = 2k ij Z 1 X + φ(x, β i )φ(x, βj ) p(x) dx α i αj 2k ij Z 1X − φ(x, β i )φ(x, βj ) p(x) dx , αi αj k ij
(3.1)
and we therefore need to evaluate integrals of the form, Z I(a, b) =
φ(x, a)φ(x, b) p(x) dx ,
(3.2)
for a, b ∈ R(p+1) . We assume that the Gaussian basis function, · ¸ 1 φ(x, a) = exp − 2 kx − ua k2 σ ¸ · a 1 T = exp − 2 (x − ua ) (x − ua ) , σa
(3.3)
is used, where ua is the center of the basis function and σa controls its width. The parameter vectors a and b are written as aT = [ σa
uTa ]
bT = [ σb
uTb ] .
Average-Case Learning Curves for Radial Basis Function Networks
449
Inserting equations 2.2 and 3.3 in equation 3.2, we obtain, I(a, b) =
1 (2π)p/2
Z
µ · 1 2 (x − ua )T (x − ua ) exp − 2 σa2 !#
2 (x − ub )T (x − ub ) + xT x dx σb2 à " à ! Z 1 T 2 2 1 x x exp − + 2 +1 = 2 σa2 (2π)p/2 σb à ! à !!# 4 1 T 4 1 T T u ua + 2 ub ub dx . +x − 2 ua − 2 ub + 2 σa σa2 a σb σb +
We now apply the identity stated in equation A.1 in order to obtain, I(a, b) = ¡
1
¢p/2 1 + (2/σa2 ) + (2/σb2 ) ! Ã Ã " 1 1 2 ¢ uTa ua 2 × exp − − 4¡ 2 σa2 σa 1 + (2/σa2 ) + (2/σb2 ) ! Ã 1 2 ¢ uTb ub +2 − 4¡ σb2 σb 1 + (2/σa2 ) + (2/σb2 ) !# 8 T ¢ ua ub − 2 2¡ . (3.4) σa σb 1 + (2/σa2 ) + (2/σb2 )
Simplifying equations 3.1 and 3.4, we obtain the final result, Z ²(w) = =
²(x, w)p(x) dx 1 X 1 X αi αj I(ui , uj , σi , σj ) + α i αj I(ui , uj , σ i , σ j ) 2k ij 2k ij −
1X αi αj I(ui , uj , σi , σ j ) , k ij
where "
4 exp uTi uj − I(ui , uj , σi , σj ) = (σ ) σi2 σj2 σ 0 Ã ! # 1 2 T − − 4 0 uj uj σj2 σj σ 0 −p/2
µ
1 2 − 4 0 σi2 σi σ
¶ uTi ui
450
Sean B. Holden and Mahesan Niranjan
and σ0 = 1 +
2 2 + 2. σi2 σj
This is a general expression that applies when all the parameters in the network are regarded as variable weights. Consider now the restricted case in which only the parameters in α are considered to be variable weights. Let M be the symmetric matrix having elements Mij = I(ui , uj , σi , σj ). Also, let N be the matrix having elements Nij = I(ui , uj , σi , σ j ), and let αT = [ α 1 α 2 · · · α k ]. Then we can write, ²(α) =
1 T 1 c1 α Mα − αT n + 2k k k
(3.5)
where n = Nα and 1X α i αj I(ui , uj , σ i , σ j ) . c1 = 2 ij The weight vector α0 that minimizes ²(α) can now be found by differentiating equation 3.5. We require that 1 ∂²(α) = [Mα − n] = 0 , ∂α k and so,
α0 = M−1 n , where we assume that the matrix M has an inverse. Consequently, the minimum value for ²(α) is · ¸ 1 1 c1 − nT M−1 n . ²0 = ²(α0 ) = (3.6) k 2 3.2 Linear Target Rules. Having obtained expressions for the generalization error of an RBFN learning a similar RBFN, we now address the case of an RBFN learning a linear target rule. In this case, the target rule is 1 fT (x) = √ γ T x , p where γ is a fixed, unknown vector of parameters. We therefore have 1 ²(x, w) = 2 =
"
#2
k 1 X 1 αi φ(x, β i ) − √ γ T x √ p k i=1
k X 1 X 1 T αi αj φ(x, β i )φ(x, βj ) − γ x αi φ(x, β i ) 2k ij (kp)1/2 i=1
+
1 T x γ γTx . 2p
Average-Case Learning Curves for Radial Basis Function Networks
451
This leads to Z ²(w) =
²(x, w)p(x) dx Z 1 X = φ(x, β i )φ(x, βj )p(x) dx αi αj 2k ij Z 1 xT γ γ T x p(x) dx + 2p Z k 1 X − α γ T xφ(x, β i )p(x) dx . i (kp)1/2 i=1
For gaussian basis functions and gaussian-distributed inputs, the first term in this expression is exactly as derived above. The second term can be evaluated using equation A.4, and is given by Z 1 T 1 xT γ γ T x p(x) dx = γ γ. 2p 2p The integral in the final term can be evaluated using equation A.2, and is given by Z
γ T xφ(x, β i )p(x) dx =
2 γ T ui + 1)p/2 (2 + σi2 ) · µ ¶¸ 1 2 − × exp uTi ui . (2σi2 + σi4 ) σi2 ((2/σi2 )
For linear target rules, we therefore obtain the final result, Z ²(w) = =
²(x, w)p(x) dx k 1 X c2 1 X αi αj I(ui , uj , σi , σj ) − αi J(ui , σi , γ ) + , 1/2 2k ij (kp) p i=1
where I(ui , uj , σi , σj ) is as defined above, J(ui , σi , γ ) =
and c2 =
1 T γ γ. 2
2 γ T ui ((2/σi2 ) + 1)p/2 (2 + σi2 ) · µ ¶¸ 1 2 − × exp uTi ui (2σi2 + σi4 ) σi2
(3.7)
452
Sean B. Holden and Mahesan Niranjan
Equation 3.7 again applies when all parameters, including centers and spread parameters, are variable. Let o be the vector having elements oi = J(ui , σi , γ ). Then in the restricted case, when only the parameters in α are considered to be variable weights, we can write, ²(α) =
1 T 1 c2 α Mα − αT o + . 2k (kp)1/2 p
Consequently, we obtain ∂²(α) 1 1 = Mα − o, ∂α k (kp)1/2 such that
α0 =
k M−1 o (kp)1/2
and ²0 = ²(α0 ) =
· ¸ 1 1 c2 − oT M−1 o . p 2
4 Learning Curves In this section we consider the learning curves for RBFNs learning the target rules considered in the previous section in the high-temperature limit. For this limit we have the partition function, Z Z0 =
p(α) exp[−kψβ²(α)] dα ,
as discussed above, and the quantity of interest is ²g (P, T) = −
1 ∂ ln Z0 P ∂β
(equation 2.8). In order to calculate Z0 , we need to specify a prior density p(α) for the weight vectors α. The usual approach here is to use this prior density to constrain the ranges of the weights, and a density is usually specified that forces weight vectors to have a specified length (see, for example, Seung et al. 1992). This is not the approach that is generally applied in practical network design, and accordingly we will use a gaussian prior, ¸ · 1 p(α) = (2π)−k/2 σα−k exp − 2 αT α . 2σα
(4.1)
Average-Case Learning Curves for Radial Basis Function Networks
453
We begin by considering nonlinear target rules. In this case we have, Z0 = (2π)−k/2 σα−k · µ ¶ ¸ Z 1 T 1 T T × exp −ψβ α Mα − α n + c1 − 2 α α dα 2 2σα · Z ³ ´¸ 1 exp − αT Qα − 2ψβ αT n + 2ψβc1 dα , = (2π)−k/2 σα−k 2 where Q = ψβM + σα−2 I .
(4.2)
Applying the standard integral identity given in equation A.1 and rearranging, we obtain · ´¸ 1³ −k −1/2 2 2 T −1 exp − 2ψβc1 − ψ β n Q n , Z 0 = σα | Q | 2 where we assume that the matrix Q has an inverse. The learning curve for the high-T limit is now given by 1 ∂ ln Z0 P ∂β · ¸ 1 1 ∂ ³ 2 2 T −1 ´ 1 ∂ ψ β n Q n − ln | Q | −ψc1 . =− P 2 ∂β 2 ∂β
²g (P, T) = −
(4.3)
At this point we need to differentiate expressions containing the inverse and determinant of the matrix Q with respect to β. It is possible to do this for Q as defined in equation 4.2, and the relevant techniques are presented in Appendix B. However, the full expression obtained for ²g (P, T) is rather difficult to interpret, and a significantly more enlightening expression can be obtained if we consider the case in which σα is large—in effect, the case in which we place no constraints on the ranges of the weights—such that, Q ' ψβM and hence, · ¸ 1 1 k T −1 ψn M n − ψc1 − ²g (P, T) = − P 2 2β · ¸ 1 1 1 c1 − nT M−1 n + . = k 2 2ψβ Comparing this expression with equation 3.6, we obtain the final result, ²g (P, T) = ²0 +
1 . 2ψβ
(4.4)
454
Sean B. Holden and Mahesan Niranjan
For the case of an RBFN learning a linear target rule, a calculation exactly analogous to that given for nonlinear target rules leads again to equation 4.4. The learning curve described by equation 4.4 is similar to the learning curve for a standard linear perceptron. To some extent, this might be expected, as when the basis functions are fixed, training corresponds to the training of a linear perceptron in a nonlinearly “extended” space. (We are in fact training a linear perceptron using a set of inputs corresponding to the outputs of the basis functions; see, for example, Holden and Rayner 1995.) However, it is interesting in the light of the following facts: First, the network as a whole is nonlinear. Second, as mentioned in Subsection 2.2, the result can be regarded as an extension of earlier results to certain situations where inputs are not gaussian-distributed. (See also Subsection 5.2 below.) 5 Discussion 5.1 Using the Results in Practice. The initial aim of this work was to extend the use of techniques from statistical physics to networks of a type that is often applied in practice. Until Section 4, our analysis applied to completely general RBFNs for which all available parameters are adapted during training. The results obtained are therefore potentially useful in any further analysis of RBFNs using this formalism. In order to obtain actual learning curves, it was necessary in Section 4 to restrict our area of interest to more limited RBFNs, for which only the parameters of the vector α are adapted during training. It was also necessary to use the high-temperature limit. These issues are now discussed, with an emphasis on the way in which they affect the practical applicability of the results obtained. 5.1.1 The Use of Fixed Basis Functions. It would clearly be desirable to extend these results to the case in which basis function parameters are chosen with reference to the available training examples, as this is in general the case in practice. At present, the need to make this restriction limits to some extent the practical applicability of the results, although we can reasonably expect that they will be applicable to some problems. Note that this restriction is also used in related work in this area (Freeman & Saad 1995) that is discussed below. 5.1.2 The High-Temperature Limit. In using the high-temperature limit, we obtain relatively fast training but at the expense of requiring a large number of training examples. (By increasing ψ, we make the effective temperature T/ψ small. See Sompolinsky et al. 1990 for a more detailed discussion of this issue.) The results obtained are therefore most suited to situations where the number of training examples available is relatively large. Note, however, that in using the high-temperature limit, we are in fact using an approximation—rather than an exact analysis—that can fail in some cir-
Average-Case Learning Curves for Radial Basis Function Networks
455
cumstances. In particular, it should be emphasized that the main results of this article require that T be large and that ψ—and hence the number of training examples—be large. (See also the comments on experimental work below.) The high-temperature limit applies when T → ∞, ψ → ∞, and ψβ is constant, and the question of how the results of Section 4 should be interpreted is therefore of interest. In Sompolinsky et al. (1990) and Schwarze et al. (1992), the high-temperature limit is used to analyze perceptrons and committee machines, and expected generalization error is plotted against the value ψβ. Experiments that are performed show that reasonable agreement with the theoretical results for the high-temperature limit can be obtained using a fixed temperature of T = 5. We can therefore consider a fixed, high T and rewrite equation 4.4 as ²g (P, T) = ²0 +
Tq . 2P
(5.1)
Equation 5.1 now shows us the manner in which ²g (P, T) decreases— approaching its minimum value ²0 —as the number P of training examples is increased.1 5.1.3 The Large σα Assumption. In deriving the learning curve in equation 4.4, it was assumed that σα is large in order to simplify the form of the matrix Q. This is equivalent to assuming that no constraints are being placed on the ranges of the weights in α and consequently does not appear to affect significantly the practical applicability of our results as weights may in practice be constrained only by the limits of the hardware on which a network is implemented. Should this assumption prove to be untenable, the more general expression for the learning curve (for general values of σα ) can be obtained as demonstrated in Appendix B. 5.2 Related Work. Freeman and Saad (1995) have studied RBFNs within a similar framework to that used here, with similar initial aims. The most significant difference between their approach and ours is in the manner in which quenched averages are dealt with. Averages of the form of equation 2.4 can present significant difficulties in that it is not always possible to evaluate the relevant integral. In this article, this problem is dealt with as a consequence of studying the high-temperature limit, which removes the need to deal explicitly with the quenched average. Freeman and Saad, on 1 Clearly this requires the assumption that taking a sufficiently high value of T is sufficient to allow practical results in reasonable agreement with the theory to be obtained, as was the case in the studies cited. This remains a subject for future work; however, there appears to be no reason to suppose that it is not the case.
456
Sean B. Holden and Mahesan Niranjan
the other hand, address the problem by introducing some assumptions regarding the basis functions, along with a further approximation. Neither the assumptions nor the approximation are required in the approach used here. Note also that Freeman and Saad’s work shares with ours the assumption that basis functions are fixed, as discussed above. Many other approaches based on techniques from statistical physics have been developed for the analysis of connectionist networks. In particular, it may be possible to obtain some of the results presented here more easily and perhaps with greater generality using alternative approaches, and in particular the “smooth networks” approach (see, for example, Sompolinsky et al. 1990 and Seung et al. 1992 for information regarding smooth networks and other relevant work).
5.3 Future Research. Areas for possible future research include the extension of our results to fully general RBFNs and to cases in which the training data are noisy, and experimental studies to investigate the hightemperature limit, as suggested above. Experimental studies would also be useful in clarifying the precise circumstances in which the high-temperature limit is useful in practice. A further area for research is the analysis of RBFNs using more sophisticated and general techniques, such as the annealed approximation and replica methods (see Seung et al. 1992).
6 Conclusion Statistical physics has made significant contributions to the study of learning curves for feedforward neural networks; however, much of the research to date applies to networks that are not used in practice to any significant extent. We have used the formalism provided by statistical physics to investigate the learning curves for gaussian radial basis function networks, which have been used in many practical applications with excellent results. We have derived general expressions for the generalization error of a network with a given set of parameters learning nonlinear and linear, realizable and unrealizable rules in the absence of noise. We then obtained learning curves for a more restricted class of networks trained using a stochastic algorithm in the high-temperature limit. The assumption that basis functions are fixed, which was required in order to obtain our learning curves, limits to some extent the practical applicability of these results, and an important area for further study is the investigation of RBFNs in which the basis functions are not fixed. Three further possible areas for future research are the experimental investigation of the results obtained using the high-temperature limit, the analysis of fully general networks in the high-temperature limit, and the analysis of RBFNs using more sophisticated techniques, such as the annealed approximation and the replica methods.
Average-Case Learning Curves for Radial Basis Function Networks
457
Appendix A: Useful Integrals We make extensive use in this article of the standard identity, ´¸ 1³ T T exp − x Xx + x y + z dx 2 µ ¶¸ · yT X−1 y 1 p/2 −1/2 z− , exp − = (2π) | X | 2 4 ·
Z
(A.1)
where X is a constant, symmetric matrix; y and z are constants; and p is the dimension of x. A derivation of this result can be found in Godsill (1993). We also make use of the identity, · ´¸ 1³ aT x exp − xT Xx + xT y + z dx 2 ´ ³ 1 = − (2π)p/2 | X |−1/2 aT X−1 y 2 µ ¶¸ · yT X−1 y 1 z− , × exp − 2 4 Z
I=
(A.2)
where X is again a symmetric matrix and a is a further constant vector. This is not, as far as we are aware, a standard result. It can be derived from equation A.1 using Leibnitz’s rule (see Kaplan 1981) by noting that I=
p X i=1
Z ai
· ´¸ 1³ xi exp − xT Xx + xT y + z dx 2
and ½ · ´¸¾ ∂ 1³ exp − xT Xx + xT y + z ∂yi 2 · ´¸ 1³ 1 = − xi exp − xT Xx + xT y + z . 2 2
(A.3)
Finally, we use the identity Z xT Xx exp[−xT Yx] dx =
π p/2 Trace(Y−1 X) , | Y |1/2 2
(A.4)
´ Ruanaidh where Y is constant and nonsingular. This result was derived by O and Fitzgerald (1993).
458
Sean B. Holden and Mahesan Niranjan
Appendix B: Obtaining the Full Expression for ²g (P, T) Consider again the equation ²g (P, T) = −
· ¸ 1 1 ∂ ³ 2 2 T −1 ´ 1 ∂ ψ β n Q n − ln | Q | −ψc1 , P 2 ∂β 2 ∂β
(B.1)
(equation 4.3), which provides an expression for ²g (P, T) that can be used to obtain the learning curve for the high-T limit for general values of σα . In this case we have, Q = ψβM + σα−2 I (equation 4.2) and the main task is to deal with the two partial differentiations. This can be done using the results, ∂ ln |Q| = (Q−1 )ij ∂(Q)ij
(B.2)
and ∂Q−1 = ∂(Q)ij
∂(Q−1 )11 ∂(Q)ij
.. .
−1
∂(Q )k1 ∂(Q)ij
··· .. .
∂(Q−1 )1k ∂(Q)ij
···
∂(Q )kk ∂(Q)ij
.. .
−1
= −Q−1 1ij Q−1 ,
(B.3)
where for any matrix A, (A)ij denotes the element at the ith row and jth column of A, and 1ij denotes the matrix having all elements zero with the exception that (1ij )ij = 1. (See Rogers 1980 for results of this type. It is assumed that relevant conditions such as the existence of Q−1 etc. are met.) Now, X ∂ ln |Q| ∂(Q)ij ∂ ln |Q| = ∂β ∂(Q)ij ∂β ij X (Q−1 )ij (M)ij . =ψ
(B.4)
ij
The remaining expression can also be differentiated as follows: ∂ h 2 2i ∂ h T −1 i ∂ h 2 2 T −1 i ψ β n Q n = ψ 2β 2 n Q n + nT Q−1 n ψ β ∂β ∂β ∂β X 2 2 ∂ −1 ni nj (Q )ij + 2ψ 2 βnT Q−1 n =ψ β ∂β ij
Average-Case Learning Curves for Radial Basis Function Networks
= ψ 2β 2
X
ni nj
ij
459
X ∂(Q−1 )ij ∂(Q)rs ∂(Q)rs ∂β rs
+2ψ βnT Q−1 n X X 2 2 −1 −1 ni nj (−Q 1rs Q )ij ψ(M)rs =ψ β 2
rs
ij 2
T
−1
+2ψ βn Q n . (B.5) Substituting equations B.4 and B.5 into equation B.1 gives the full expression for ²g (P, T). These techniques also allow ²g (P, T) to be analyzed for general values of σα for the case of linear target rules. Acknowledgments This research was supported by Science and Engineering Research Council research grant GR/H16759. We thank Marc M´ezard and Sebastian Seung for their useful comments on this work, and we also thank the anonymous referees for several helpful suggestions. In particular, we thank a referee for pointing out equations B.2 and B.3 to us. References Amit, Daniel J. (1989). Modeling brain function: The world of attractor neural networks. Cambridge: Cambridge University Press. Anthony, M., & Biggs, N. (1992). Computational learning theory. Cambridge Tracts in Theoretical Computer Science. Cambridge: Cambridge University Press. Anthony, M., & Holden, S. B. (1994). Quantifying generalization in linearly weighted neural networks. Complex Systems, 8, 91–114. Barkai, E., Hansel, D., & Sompolinsky, H. (1992, March). Broken symmetries in multilayered perceptrons. Physical Review A, 45, 4146–4161. Broomhead, D. S., & Lowe, D. (1988). Multivariable functional interpolation and adaptive networks. Complex Systems, 2, 321–355. Freeman, J. A. S., & Saad, D. (1995). Learning and generalization in radial basis function networks. Neural Computation, 7, 945–965. Godsill, S. (1993). The restoration of degraded audio signals. Unpublished doctoral dissertation, Cambridge University, Cambridge, U.K. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Santa Fe Institute Studies in the Sciences of Complexity. AddisonWesley, Reading, MA. Holden, S. B., & Rayner, P. J. W. (1995, March). Generalization and PAC learning: Some new results for the class of generalized single layer networks. IEEE Transactions on Neural Networks, 6(2), 368–380.
460
Sean B. Holden and Mahesan Niranjan
Kaplan, W. (1981). Advanced mathematics for engineers. Reading, MA: AddisonWesley. Landau, L. D.& Lifshitz, E. M. (1980). Statistical physics: Course of theoretical physics (3rd ed.). New York: Pergamon Press. Lee, W. S., Bartlett, P. L., & and Williamson, R. C. (1994). Lower bounds on the VC dimension of smoothly parameterized function classes. In Proceedings of the 7th Annual Workshop on Computational Learning Theory (pp. 362–367), New York: ACM Press. Levin, E., Tishby, N., & Solla, S. A. (1990, October). A statistical approach to learning and generalization in layered neural networks. Proceedings of the IEEE, 78, 1568–1574. Moody, J., & Darken, C. J. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1, 281–294. Niranjan, M., & Fallside, F. (1990, July). Neural networks and radial basis functions in classifying static speech patterns. Computer Speech and Language, 4, 275–289. Also available as Cambridge University Engineering Department Report CUED/F-INFENG/TR 22 (1988). Powell, M. J. D. (1985, October). Radial basis functions for multivariable interpolation: A review. (Tech. Rep. DAMTP 1985/NA12). Cambridge: Department of Applied Mathematics and Theoretical Physics, University of Cambridge. Powell, M. J. D. (1990, December). The theory of radial basis function approximation in 1990. (Tech. Rep. DAMTP 1990/NA11). Cambridge: Department of Applied Mathematics and Theoretical Physics, University of Cambridge. ´ Ruanaidh, J. J. K., & Fitzgerald, W. J. (1993). The restoration of digital audio O recordings using the Gibbs sampler. (Tech. Rep. CUED/F-INFENG/TR.153). Cambridge: Engineering Department, Cambridge University. Renals, S., & Rohwer, R. (1989, June). Phoneme classification experiments using radial basis functions. In Proceedings of the International Joint Conference on Neural Networks, pp. I-461–I-467. Rogers, G. S. (1980). Matrix derivatives. Lecture Notes in Statistics, Vol. 2. New York: Marcel Dekker. Schwarze, H., Opper, M., & Kinzel, W. (1992). Generalization in a two-layer neural network. Physical Review A, 46, R6185–R6188. Seung, H. S., Sompolinsky, H., & Tishby, N. (1992, April). Statistical mechanics of learning from examples. Physical Review A, 45, 6056–6091. Sompolinsky, H., Tishby, N., & Seung, H. S. (1990). Learning from examples in large neural networks. Physical Review Letters, 65, 1683–1686. Watkin, T. L. H., Rau, A., & Biehl, M. (1993, April). The statistical mechanics of learning a rule. Rev. Mod. Phys., 65, 499–556.
Communicated by Scott Fahlman
A Sequential Learning Scheme for Function Approximation Using Minimal Radial Basis Function Neural Networks Lu Yingwei N. Sundararajan P. Saratchandran School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore
This article presents a sequential learning algorithm for function approximation and time-series prediction using a minimal radial basis function neural network (RBFNN). The algorithm combines the growth criterion of the resource-allocating network (RAN) of Platt (1991) with a pruning strategy based on the relative contribution of each hidden unit to the overall network output. The resulting network leads toward a minimal topology for the RBFNN. The performance of the algorithm is compared with RAN and the enhanced RAN algorithm of Kadirkamanathan and Niranjan (1993) for the following benchmark problems: (1) hearta from the benchmark problems database PROBEN1, (2) Hermite polynomial, and (3) Mackey-Glass chaotic time series. For these problems, the proposed algorithm is shown to realize RBFNNs with far fewer hidden neurons with better or same accuracy. 1 Introduction Radial basis function neural networks (RBFNN) are well suited for function approximation and pattern recognition due to their simple topological structure and their ability to reveal how learning proceeds in an explicit manner (Tao 1993). In the classical approach to radial basis function (RBF) network implementation, the basis functions are usually chosen as gaussian and the number of hidden units (i.e., centers and widths of the gaussian functions) is fixed a priori based on the properties of input data. The weights connecting the hidden and output units are estimated by a linear least squares method, as in the least mean square (LMS) method (Moody & Darken 1989; Musavi et al. 1992). The disadvantage with this approach is that it is not suitable for sequential learning and usually results in too many hidden units (Bors & Gabbouj 1994). A significant contribution that overcomes these drawbacks was made by Platt (1991) through the development of an algorithm that adds hidden units to the network based on the novelty of the new data. The algorithm Neural Computation 9, 461–478 (1997)
c 1997 Massachusetts Institute of Technology °
462
Lu Yingwei, N. Sundararajan, and P. Saratchandran
is suitable for sequential learning and is based on the idea that the number of hidden units should correspond to the complexity of the underlying function as reflected in the observed data. The resulting network, called a resource-allocating network (RAN), starts with no hidden units and grows by allocating new hidden units based on the novelty in the observations that arrive sequentially. If an observation has no novelty, then the existing parameters of the network are adjusted by an LMS algorithm to fit that observation. An interpretation of Platt’s RAN from a function space approach was provided by Kadirkamanathan and Niranjan (1993), who also enhanced its performance by adopting an Extended Kalman Filter (EKF) instead of the LMS method for adjusting network parameters. They called the resulting network a Resource-Allocating Network via Extended Kalman Filter (RANEKF) and showed its superior performance to the RAN for the tasks of function approximation and time-series prediction (Kadirkamanathan & Niranjan 1993). One drawback of both RAN and RANEKF is that once a hidden unit is created, it can never be removed. Because of this, both RAN and RANEKF could produce networks in which some hidden units, although active initially, may subsequently end up contributing little to the network output. If such inactive hidden units can be detected and removed as learning progresses, then a more parsimonious network topology can be realized. This has been the motivation for us to combine pruning to the RANEKF scheme. The following function approximation problem clearly illustrates how RANEKF could produce more numbers of hidden units than required. 1.1 Function Approximation Problem. In this problem, the RANEKF has to approximate a nonlinear function (Chen et al. 1993), which is a linear combination of six gaussian functions: y(x) =
+(x2 −0.2) +(x2 −0.2) ] + exp[− (x1 −0.7)0.01 ] exp[− (x1 −0.3)0.01 2
2
2
2
+(x2 −0.5) +(x2 −0.5) ] + exp[− (x1 −0.9)0.02 ] + exp[− (x1 −0.1)0.02 2
2
2
2
+(x2 −0.8) +(x2 −0.8) ] + exp[− (x1 −0.7)0.01 ]. + exp[− (x1 −0.3)0.01 2
2
2
2
(1.1)
Theoretically we can say that to approximate this function, the RANEKF network should need only six hidden neurons. To test this, training samples are randomly chosen in the interval (0, 1) and the observations sequentially presented to RANEKF. Figure 1 shows the results of applying the RANEKF algorithm to this problem. From Figure 1a, we can see the number of hidden neurons in the RANEKF network increasing with the number of observations and the network’s finally settling to 15 hidden units instead of the desired 6 hidden units. Thus the RANEKF overfits and so the network realized by this method is not minimal. Figure 1b shows the desired 6 centers (+) and the 15 centers produced by RANEKF (◦).
A Sequential Learning Scheme for Function Approximation
463
Figure 1: Number and positions of hidden unit centers achieved by RANEKF for the static function approximation problem.
464
Lu Yingwei, N. Sundararajan, and P. Saratchandran
Those hidden units in the network, which are far away from the desired centers, will have a significant output value only when the input is near their center. But for approximating the function of equation 1.1, a zero output is required for all inputs that are far from the desired centers. Therefore, the circles in Figure 1b, which are far from the desired centers, end up making little contribution to the network’s outputs and should be removed from the network. We notice that corresponding to the desired centers (0.3,0.8), (0.7,0.8), (0.9,0.5), and (0.3,0.2), there are two overlapping circles. The two hidden units with the same center will provide no more information than one does, so one of them is superfluous and ought to be pruned. Generally, all those hidden units that contribute little to the network output are superfluous. In this article we propose an algorithm that adopts the basic idea of RANEKF and augments it with a pruning strategy to obtain a minimal RBFNN network. The pruning strategy removes those hidden neurons that consistently make little contribution to the network output and together with an additional growth criterion ensures that the transition in the number of hidden neurons is smooth. Although theoretical proofs showing that the resulting RBFNN topology is in some sense minimal is still under investigation, we have referred it as a minimal RAN (M-RAN) based on detailed simulation studies carried out on a number of benchmark problems (Yingwei et al. 1995). For all those problems, the proposed approach produced a network with fewer hidden neurons than both the RAN and the RANEKF with the same or smaller approximation errors. The article is organized as follows. Section 2 describes the the M-RAN algorithm. Sections 2.1 and 2.2 present the basic strategy for growth and pruning. The performance of M-RAN for the static function approximation problem of equation 1.1 is given in Section 2.3. An additional growth criterion based on the output error over a sliding window is introduced in Section 2.4 to smooth the transitions in the growth and pruning. Section 2.5 gives the final algorithm and its performance for the problem of equation 1.1. In Section 3 the performance of M-RAN is compared with RAN and RANEKF for two static function approximation problems (PROBEN1hearta and Hermite polynomial) and the dynamic Mackey-Glass chaotic time-series prediction problem. Finally, conclusions are summarized in Section 4. 2 Proposed M-RAN Algorithm In this section, a description of the M-RAN algorithm is presented. The notations and symbols used here are the same as those in Kadirkamanathan and Niranjan (1993). 2.1 Basic Growth Strategy. The basic growth strategy used in our algorithm is the same as that of RAN and RANEKF. The output of the network
A Sequential Learning Scheme for Function Approximation
465
for all three algorithms (RAN, RANEKF, and M-RAN) has the following form: f (x) = α0 +
K X
αk φk (x) ,
(2.1)
k=1
where φk (x) is the response of the kth hidden neuron to the input x and αk is the weight connecting the kth hidden unit to the output unit. α0 is the bias term. Here, K represents the number of hidden neurons in the network. φk (x) is a gaussian function given by Ã
! 1 2 φk (x) = exp − 2 kx − µk k , σk
(2.2)
where µk is the center and σk is the width of the gaussian function. kk denotes the Euclidean norm. The learning process of M-RAN involves allocation of new hidden units as well as adjusting network parameters. The network begins with no hidden units. As observations are received, the network grows by using some of them as new hidden units. The following two criteria must be met for an observation (xn , yn ) to be used to add a new hidden unit to the network: kxn − µnr k > ²n .
(2.3)
en = yn − f (xn ) > emin ,
(2.4)
where µnr is the center (of the hidden unit) closest to xn . ²n and emin are thresholds to be selected appropriately. When a new hidden unit is added to the network, the parameters associated with the unit are αK+1 = en , µK+1 = xn , σK+1 = κkxn − µnr k. κ is an overlap factor, which determines the overlap of the responses of the hidden units in the input space. When the observation (xn , yn ) does not meet the criteria for adding a new hidden unit, the network parameters w = [α0 , α1 , µ1 T , σ1 , . . . , αK , µK T , σK ]T are adapted using the EKF as follows: wn = w(n−1) + en kn .
(2.5)
kn is the Kalman gain vector given by kn = [Rn + an T Pn−1 an ]−1 Pn−1 an .
(2.6)
466
Lu Yingwei, N. Sundararajan, and P. Saratchandran
an is the gradient vector and has the following form: · 2α1 2α1 an = 1, φ1 (xn ), φ1 (xn ) 2 (xn − µ1 )T , φ1 (xn ) 3 kxn − µ1 k2 , . . . , σ1 σ1 2αK 2αK φK (xn ), φK (xn ) 2 (xn − µK )T , φK (xn ) 3 kxn − µK k2 σK σK
¸T .
(2.7)
Rn is the variance of the measurement noise. Pn is the error covariance matrix, which is updated by Pn = [I − kn an T ]Pn−1 + QI ,
(2.8)
where Q is a scalar that determines the allowed random step in the direction of gradient vector.1 If the number of parameters to be adjusted is N, Pn is an N × N positive definite symmetric matrix. When a new hidden unit is allocated, the dimensionality of Pn increases to µ Pn =
Pn−1 0
0 P0 I
¶ ,
(2.9)
and the new rows and columns are initialized by P0 , which is an estimate of the uncertainty in the initial values assigned to the parameters. The dimension of identity matrix I is equal to the number of new parameters introduced by the new hidden unit. 2.2 Pruning Strategy. This section presents a scheme to prune superfluous hidden units in the M-RAN network. To remove the hidden units that make little contribution to the network output, first consider the output ok of the hidden unit k: Ã ! kx − µk k2 . (2.10) ok = αk exp − σk2 If αk or σk in equation 2.10 is small, ok might become small. Alternately if kx − µk k is large, which means the input is far from the center of this hidden unit, the output becomes small. In order to decide whether a hidden neuron should be removed, the output values (ok ) for the hidden units are examined continuously. If the output of a hidden unit is less than a threshold over a number of M consecutive inputs, then that hidden unit is removed from 1 According to Kadirkamanathan and Niranjan (1993), the rapid convergence of the EKF algorithm may prevent the model from adapting to future data, and QI is introduced to avoid this problem.
A Sequential Learning Scheme for Function Approximation
467
the network.2 Since the use of absolute values can cause inconsistency in pruning, the outputs of hidden units are normalized, and the normalized output values are used in the pruning criterion. This pruning strategy is expressed as follows: • For every observation (xn ,yn ), compute the outputs of all hidden units onk (k = 1, . . . , K), using equation 2.10. • Find the largest absolute hidden unit output value konmax k. Compute the normalized output values rnk , (k = 1, on
. . . , K), for the hidden units where rnk = k onk k. max
• Remove the hidden units for which the normalized output is less than a threshold δ for M consecutive observations. • Adjust the dimensionality of the EKF to suit the reduced network. 2.3 Performance of the Pruning Strategy for Function Approximation. The pruning strategy is added to the basic strategy of Section 2.1, and the resulting algorithm is tested on the function approximation problem of equation 1.1. Figure 2 shows the results of the algorithm. It can be clearly seen that the proposed pruning strategy helps to reduce the number of hidden neurons to 6 after 4500 samples. It can also be observed that the algorithm causes the number of hidden neurons to increase by 13 initially (0–600 samples), after which pruning begins to take effect and brings the number of hidden units to 7 at around 1000 samples, and stays with 7 until 4500 samples before reaching the correct value of 6 hidden units. This shows that pruning helps to realize a minimal RBFNN, although the transitions in the number of hidden neurons are not always smooth. Also the minimum network was realized after 4500 samples, which is rather long. This motivated us to include an extra criterion for growth to smooth the transitions. 2.4 Sliding Window RMS Criterion for Adding Hidden Neurons. Besides the novelty criteria given by equations 2.3 and 2.4, we include a third criterion, which uses the root mean square (RMS) value of the output error over a sliding data window before adding a hidden unit. The RMS value of network output error at nth observation ermsn is given by sP ermsn =
2
n 2 i=n−(M−1) (ei )
M
,
For detailed analysis of this pruning criterion, refer to Yingwei et al. (1995).
(2.11)
468
Lu Yingwei, N. Sundararajan, and P. Saratchandran
Figure 2: Number and positions of hidden unit centers obtained with pruning strategy on static function approximation problem.
A Sequential Learning Scheme for Function Approximation
469
where ei = yi − f (xi ). In equation 2.11, for each observation, a sliding data window is employed to include a certain number of output errors based on yn , yn−1 , . . . , yn−(M−1) . When a new observation arrives, the data in this window are updated by including the latest data and eliminating the oldest. For example, at the nth observation, the data window includes en , en−1 , . . . , en−(M−1) ; when the (n + 1)th observation arrives, the data window will contain en+1 , en , . . . , en−(M−2) . In this sense, it seems that this data window slides along with the data obtained on-line; thus we call it a data sliding window. The size of the window is determined by the value of M—that is, the number of samples used to calculate the RMS error. Thus, the extra growth criterion to be satisfied is ermsn > e0min , where e0min is a threshold value to be selected. 2.5 The Final M-RAN Algorithm. In this section, the final M-RAN sequential learning algorithm is presented. Algorithm. For each input xn , compute: Ã ! K X 1 2 φk (xn ) = exp − 2 kxn − µk k , f (xn ) = α0 + αk φk (xn ), σk k=1 ²n = max{²max γ n , ²min }, where 0 < γ < 1 is a decay constant.3 sP n 2 i=n−(M−1) [yi − f (xi )] . en = yn − f (xn ), ermsn = M If en > emin , and kxn −µnr k > ²n , and ermsn > e0min , then allocate a new hidden unit with αk+1 = en , and µk+1 = xn , and σk+1 = κkxn − µnr k. Else
wn = w(n−1) + kn en , kn = [Rn + an T Pn−1 an ]−1 Pn−1 an , Pn = [I − kn an T ]Pn−1 + QI. Compute the outputs of all hidden units onk , (k = 1, . . . , K). Find the largest absolute hidden unit output value konmax k. Calculate the normalized value for each hidden unit ° n ° ° ok ° n ° rk = ° ° on ° , (k = 1, . . . , K) max
if rnk < δ for M consecutive observations, then prune the kth hidden neuron and reduce the dimensionality of Pn to fit for the requirement of EKF. 3
The value for ²n is decayed until it reaches ²min .
470
Lu Yingwei, N. Sundararajan, and P. Saratchandran
This final algorithm is tested on the function approximation problem of section 1.1, and the results are presented in Figure 3. From Figure 3, it can be seen that the 6 hidden units on the desired positions are obtained within 600 samples. Also, the transitions during the build-up and drop-off of the hidden neurons are smooth compared to Figure 2. 3 Performance of the Algorithm on Benchmark Problems In this section, the sequential learning strategy described in Section 2 is applied to three benchmark problems. The first two are static function approximation problems, and the third is a time-series prediction problem. The first problem is taken from the benchmark problems PROBEN1 (Prechelt 1994), set up recently and available from http://www.ipd.ira.uka.de/∼ prechelt/ NIPS bench.html. The problem selected for this study is the hearta data set, which is based on the “heart disease” diagnosis data sets from the University of California, Irvine, machine learning data set. The second problem is to approximate the Hermite polynomial, and the data in this case are randomly sampled consistent with the underlying function and presented sequentially to the network. The third example is on-line prediction of a chaotic time series given by the Mackey-Glass equation where an underlying function for the data does not exist. For all these problems, the performance of our algorithm is compared with the RANEKF and the RAN. 3.1 Function Approximation Problem: PROBEN1-hearta Benchmark. The hearta data set consists of 920 examples (690 training and 230 test examples) with each example described by 13 input attributes and 1 output attribute. Out of the 920 examples, only 299 are complete; the rest have missing values. This makes hearta a difficult function approximation problem. Our networks for RAN, RANEKF, and M-RAN used 13 input units and a single continuous output to represent the five different levels of 1 output attribute. According to the guidelines given in Prechelt (1994), the missing values were replaced with the mean of the nonmissing values for that attribute. The training data for the networks consist of both the training set and validation set, and the approximation error is calculated over the test data provided in the data set. The performance of the three algorithms is evaluated on the three different permutations of the hearta data set— hearta1, hearta2, and hearta3—available in PROBEN1. The values for the various parameters used in this experiment are as follows: ²max = 4.0, ²min = 1.5, γ = 0.99, emin = 0.1, κ = 0.4, P0 = Rn = 1.0, Q = 0.00001, and η = 0.01. The threshold value δ = 0.00001, the size of sliding data window M = 60, and the threshold value of the RMS error in the additional growth criterion e0min = 0.25. Figure 4a shows the growth pattern for the three networks as they learn sequentially from the training data. Figure 4b shows the approximation error, measured by the given squared error percentage (E) in Prechelt (1994),
A Sequential Learning Scheme for Function Approximation
471
Figure 3: Number and positions of hidden unit centers achieved with M-RAN.
472
Lu Yingwei, N. Sundararajan, and P. Saratchandran
Figure 4: Performance of RAN, RANEKF, and M-RAN for hearta1.
A Sequential Learning Scheme for Function Approximation
473
Table 1: Performance of Different Networks for hearta1, hearta2, and hearta3.
Benchmark data
hearta1
hearta2
hearta3
Network
Number of hidden neurons
Squared error percentage for testing set
Number of training epochs
M-RAN RANEKF RAN BP
8 13 13 32
4.14 4.21 5.20 4.55
1 1 1 47
M-RAN RANEKF RAN BP
9 11 11 16
3.81 3.81 5.71 4.33
1 1 1 54
M-RAN RANEKF RAN BP
7 11 14 32
3.80 3.91 5.63 4.89
1 1 1 46
over the test data. E is expressed as follows: E = 100 ∗
N P X omax − omin X (opi − tpi )2 , N ∗ P p=1 i=1
(3.1)
where omin and omax are the minimum and the maximum values of the output coefficients in the problem. N is the number of output nodes of the network, and P is the number of patterns (examples) in the data set considered. opi and tpi are the actual and desired output values at the ith output node and pth pattern, respectively. Table 1 shows the results from the three algorithms together with the performance of multilayer backpropagation (BP) networks for pivot architectures (Prechelt 1994). It can be seen from the table that our algorithm achieves the same or a lower approximation error than RAN and RANEKF and BP with fewer hidden units. The values quoted for pivot architectures are from Table 11 of Prechelt (1994). 3.2 Function Approximation Problem: Hermite Polynomial. In this section, results for approximating the Hermite Polynomial are presented. The Hermite Polynomial is given by the following equation: µ ¶ 1 f (x) = 1.1(1 − x + 2x2 ) exp − x2 . 2
(3.2)
A random sampling of the interval [−4, 4] is used in obtaining 40 inputoutput data for the training set. All parameter values are chosen the same as
474
Lu Yingwei, N. Sundararajan, and P. Saratchandran
those in Kadirkamanathan and Niranjan (1993)—that is, ²max = 2.0, ²min = 0.2, γ = 0.977, emin = 0.02, κ = 0.87, P0 = Rn = 1.0, Q = 0.02, η = 0.05. The additional parameters required for our algorithm are as follows: the threshold value δ = 0.01, the size of sliding data window M = 25, and the threshold value of the RMS error in the additional growth criterion e0min = 0.3. Figure 5 shows the results from the three algorithms. It can be seen from the figure that our algorithm achieves a slightly lower approximation error, measured by the RMS error4 calculated over 200 uniformly sampled data, than RANEKF with only 7 hidden units compared to the 13 hidden units of RANEKF. RAN needed 18 hidden units and achieved an RMS error of nearly an order of magnitude larger than both our algorithm and the RANEKF. With no noise in the training set, the M-RAN needed only about half the number of hidden units of RANEKF. The effect of noise on the performance of the three algorithms was analyzed by adding different levels of gaussian white noise to the training patterns, and here too M-RAN always realized more compact networks with smaller approximation errors (Yingwei et al. 1995). As an example, when the noise variance was 0.07, the number of hidden units for M-RAN was 11, for RANEKF 40, and for RAN 20. 3.3 Prediction of Chaotic Time Series. The chaotic time series used is the Mackey-Glass series, which is governed by the following time-delay ordinary differential equation: s(t − τ ) ds(t) = −bs(t) + a , dt 1 + s(t − τ )10
(3.3)
with a = 0.2, b = 0.1, and τ = 17. Integrating the equation over the time interval [t, t + 1t] by the trapezoidal rule yields x(t + 1t) =
2 − b1t x(t) 2 + 1t · ¸ x(t + 1t − τ ) x(t − τ ) a1t + . + 2 + b1t 1 + x10 (t + 1t − τ ) 1 + x10 (t − τ )
(3.4)
Setting 1t = 1, the time series is generated under the condition x(t−τ ) = 0.3 for 0 ≤ t ≤ τ (τ = 17). The initial 4000 data points are discarded for the initialization transients to decay.5 The series is predicted with ν = 50 sample steps ahead using four past samples: sn−ν , sn−ν−6 , sn−ν−12 , sn−ν−18 . Hence, the nth input-output data for 4
RMSE is used for comparison to be consistent with Kadirkamanathan and Niranjan (1993). 5 These conditions are the same as those in Platt (1991).
A Sequential Learning Scheme for Function Approximation
475
Figure 5: Performance of RAN, RANEKF, and M-RAN for Hermite Polynomial without noise.
476
Lu Yingwei, N. Sundararajan, and P. Saratchandran
the network to learn are xn = [sn−ν , sn−ν−6 , sn−ν−12 , sn−ν−18 ]T , yn = sn whereas the step-ahead predicted value at time n − ν is given by zn = f (n−ν) (xn ),
(3.5)
where f (n−ν) (xn ) is the network at time (n − ν). The ν step-ahead prediction error is ²n = sn − zn .
(3.6)
The exponentially weighted prediction error (WPE) is chosen as a measure of the performance of the network prediction. WPE at time (n) can be recursively computed as WPE(n) = λWPE(n−1) + (1 − λ)k²n k2 , 2
2
(3.7)
where 0 < λ < 1. Here λ is chosen as λ = 0.95. The parameter values are selected as follows: ²max = 0.7, ²min = 0.07, γ = 0.999, emin = 0.05, κ = 0.87, P0 = Rn = 1.0, Q = 0.0002, η = 0.02, δ = 0.0001; M = 90; and e0min = 0.04. Figure 6c displays the approximation curve of the M-RAN during the sample time between 4500 and 5000. The desired output curve is plotted with a slash-dotted line, while the output of the minimal network is plotted with a solid line. This figure shows that the approximation curve of our MRAN network fits the desired curve at most places. The number of hidden units and the network performance are shown in Figures 6a and 6b, respectively. In this case, the value of WPE is almost same for all three algorithms, while the M-RAN needed only 29 hidden units, compared with 39 hidden units achieved by RANEKF, and 81 hidden units of RAN. 4 Conclusions In this article, a sequential learning algorithm for realizing a minimal RBF neural network is developed. This algorithm augments the growth criteria of RANEKF and combines with it a scheme to remove those hidden units that make insignificant contributions to the output of the network. Using benchmark problems covering both static (hearta, Hermite Polynomial) and dynamic (prediction for a chaotic time series) approximations, the algorithm developed is shown to produce an RBF neural network with smaller complexity than both RANEKF and RAN, with equal or better approximation accuracy.
A Sequential Learning Scheme for Function Approximation
477
Figure 6: Performance of RAN, RANEKF, and M-RAN for Mackey-Glass timeseries prediction problem.
478
Lu Yingwei, N. Sundararajan, and P. Saratchandran
Acknowledgments We thank the anonymous reviewers for their constructive comments and also for pointing out the PROBEN1 benchmark database. References Bors, A. G., & Gabbouj, M. (1994). Minimal topology for a radial basis functions neural network for pattern classification. Digital Signal Processing, 4, 173–188. Chen, C., Chen, W., & Cheng, F. (1993). Hybrid learning algorithm for Gaussian potential function network. IEE Proceedings-D, 140, 442–448. Kadirkamanathan, V., & Niranjan, M. (1993). A function estimation approach to sequential learning with neural network. Neural Computation, 5, 954–975. Moody, J., & Darken, C. J. (1989). Fast learning in network of locally-tuned processing units. Neural Computation, 1, 281–294. Musavi, M., Ahmed, W., Chan, K., Faris, K., & Hummels, D. (1992). On training of radial basis function classifiers. Neural Networks, 5, 595–603. Platt, J. (1991). A resource allocating network for function interpolation. Neural Computation, 3, 213–225. Prechelt, L. (1994). Proben1—a set of neural network benchmark problems and benchmarking rules. (Tech. Rep. 21/94). Fakultat fur Informatik, University Karlsruhe. Tao, K. (1993). A closer look at the radial basis function (RBF) networks. In A. L. A. Singh (Ed.), Conference Record of 27th Asilomar Conference on Signals, System and Computers, pp. 401–405. New York: IEEE. Yingwei, L., Sundararajan, N., & Saratchandran, P. (1995). A sequential learning scheme for fuction approximation using minimal radial basis function (RBF) neural networks. (Tech. Rep. CSP/9505). Singapore: Center for Signal Processing, School of Electrical and Electronic Engineering, Nanyang Technological University.
Received December 11, 1995; accepted June 21, 1996.
ARTICLE
Communicated by Andrew Barto and Steven Nowlan
A Neuro-Mimetic Dynamic Scheduling Algorithm for Control: Analysis and Applications Harpreet S. Kwatra Francis J. Doyle III School of Chemical Engineering, Purdue University, West Lafayette, IN 47907-1283 USA
Ilya A. Rybak James S. Schwaber Neural Computation Program, E.I. duPont de Nemours & Co., Wilmington, DE 19880-0328 USA
A simple neuronal network model of the baroreceptor reflex is analyzed. From a control perspective, the analysis suggests a dynamic scheduled control mechanism by which the baroreflex may perform regulation of the blood pressure. The main objectives of this work are to investigate the static and dynamic response characteristics of the single neurons and the network, to analyze the neuromimetic dynamic scheduled control function of the model, and to apply the algorithm to nonlinear process control problems. The dynamic scheduling activity of the network is exploited in two control architectures. Control structure I is drawn directly from the present model of the baroreceptor reflex. An application of this structure for level control in a conical tank is described. Control structure II employs an explicit set point to determine the feedback error. The performance of this control structure is illustrated on a nonlinear continuous stirred tank reactor with van de Vusse kinetics. The two case studies validate the dynamic scheduled control approach for nonlinear process control applications.
1 Introduction A neuronal network architecture for scheduled control of nonlinear process systems was presented recently by Doyle III et al. (1994). The network was originally employed in a simple model of the mammalian baroreceptor reflex (Doyle III et al. 1997). This neuronal network was observed to exhibit a scheduled control activity. Therefore, it was proposed to apply the network in a dynamic scheduled control framework for applications in nonlinear process control. This article presents new results in two main categories: (1) qualitative and quantitative analyses of dynamics of both the single Neural Computation 9, 479–502 (1997)
c 1997 Massachusetts Institute of Technology °
480
Harpreet S. Kwatra et al.
neuron model and the network and (2) case studies of control of a conical tank and a nonlinear continuous stirred tank reactor (CSTR) with van de Vusse kinetics. The baroreceptor vagal reflex is responsible for the regulation of blood pressure and displays a rich range of controlled dynamic behavior. The components of the reflex include pressure sensors in the major blood vessels, a neuronal processor, and several classes of motor actuators of the heart. The pressure signal is encoded by first-order neurons. This information is projected onto second-order neurons in the nucleus tractus solitarus (NTS) (Schwaber 1987), which process this information and modulate actuator signals to the heart to change cardiac rate, contractile timing and pattern, and contractile strength. The closed-loop behavior is fairly intuitive: as pressure rises above “normal,” a control signal is generated in the NTS, which lowers the heart rate. The first-order sensory neurons are well described as highly sensitive, rapidly adapting neurons that transduce and encode each pulse of pressure with a train of spikes on the rising phase of pressure. Because of the distribution of pressure thresholds, the dynamic first-order neurons have a distributed sensitivity to the measured variable (blood pressure) and its rate of change. Data in the literature (Donoghue et al. 1982; Czachurski et al. 1988; Bradd et al. 1989; Rogers et al. 1993) suggest that these neurons map the spatial pressure distribution onto the inputs to the second-order neurons and “schedule” their activities during dynamic changes of the pressure. This observation and general principles of sensory system organization suggest that second-order neurons have a “competitive” response to the input from the first-order neurons. It is plausible that a lateral inhibition organization provides both the competition between the second-order neurons and their scheduled response. This competition may be simulated by a simple neuronal network model in which each second-order neuron responds and provides a control signal in a definite static (dependent on the thresholds of the corresponding first-order neurons) and dynamic (dependent on the velocity of pressure increase) range of pressure changes. Such behavior can be exploited in the formulation of scheduling algorithms for use in controller design. Just as competition between secondorder neurons leads to a selective dynamic response, this attribute can be utilized to apply the appropriate level of nonlinear feedback compensation for a control system. One particularly attractive feature of this biological algorithm is the graceful transition exhibited between operating regimes: there does not appear to be a discrete set of operating levels but rather a continuum. In addition, the transition is dynamic; it is dependent not only on the magnitude of the blood pressure but on its rate of change as well. These two characteristics of the algorithm are in direct contrast to traditional scheduling approaches, which utilize lookup tables for a discrete set ˚ om of operating levels (Astr ¨ & Wittenmark 1989).
A Neuro-Mimetic Dynamic Scheduling Algorithm for Control
481
Figure 1: Simplified architecture of the baroreceptor vagal reflex. The lateral inhibitory function proposed for the inhibitory interneurons is implemented by the first-order neurons themselves for the sake of simplicity. Solid lines represent synaptic excitation; dashed ones signify inhibition. A 9 × 6 network of nine firstorder and six second-order neurons is shown.
2 A Model of the Baroreceptor Reflex An analysis of the experimental results (Rogers et al. 1993; Doyle III et al. 1997) reveals the following dynamic properties of the baroreceptor reflex control structure: 1. The second-order NTS neurons respond to blood pressure changes with bursts of activity that have a characteristic frequency much lower than the frequency of the cardiac cycle. 2. The responses suggest that NTS neurons are inhibited immediately after and perhaps immediately before the bursts. 3. It is reasonable to assume that this bursting activity is the source of regulatory signals that cause the compensatory changes in pressure. 4. It is plausible that each NTS neuron provides this regulation in a defined static and dynamic range of pressure. These observations, combined with other physiological data and general principles of sensory system organization, support the plausibility of a simple baroreflex model. A diagram of the proposed network model for the closed-loop baroreflex is shown in Figure 1 (note that the focus of this model is restricted to the vagal baroreflex, which affects the cardiac output; the effects of the sympathetic system on the heart and the peripheral resistance in the vascular beds have been ignored). The first-order neurons, which are arranged in increasing order of pressure threshold, receive an excitatory input signal that is proportional to the blood pressure. The second-order neurons
482
Harpreet S. Kwatra et al.
receive synaptic excitation from the first-order neurons as depicted in Figure 1 by the solid lines. While the synaptic inhibition of the second-order neurons probably occurs via inhibitory neurons, a simplified model that does not use interneurons is employed, and their function is implemented using first-order neurons only. This is achieved by direct synaptic inhibition from the neighboring first-order neurons (i.e., the periphery of the receptive field; Hubel & Wiesel 1962), as depicted in Figure 1 by the dashed lines. A more realistic mechanism could employ inhibitory interneurons and reciprocal inhibition between the second-order neurons. 2.1 Model of a Single Neuron. Following the Hodgkin-Huxley formalism, the nonlinear dynamics of the membrane potential of a neuron can be described by the following differential equation (Doyle III et al. 1997): cV˙ = g0 (Vr − V) +
X
gi (Ei − V) + I
(2.1)
i
where c is the membrane capacitance, V is the membrane potential, Vr is the resting membrane potential, g0 is the generalized conductance, gi is the deviation in the conductance of the ith ionic channel, Ei is the reversal potential of the ith ionic channel, and I is the input current. Three types of conductances (gi ) are used in the current model: conductances for excitatory and inhibitory synaptic channels that are opened by action potentials (AP) coming from other neurons and a conductance for the potassium channel opened by AP generation in the neuron itself. The potassium channel conductance, threshold dynamics, and neuronal output are modeled by additional nonlinear differential equations. A detailed description, along with the parameter values, is available in Doyle III et al. (1997). Although the model represents a considerable simplification of the more extensive multiple channel neuron kinetic models (Schwaber et al. 1993), it describes the behavior of the baroreflex neurons with sufficient accuracy for the purpose of network modeling. Analysis of Dynamics of Single Neurons. The neuron model above is composed of several complex nonlinear differential equations. It is extremely difficult to interpret the input-output behavior of the model directly from these differential equations. This motivates an analysis of the dynamics of the neuron model in terms of its steady-state locus and dynamic response. Such an analysis validates the model against known experimental results, and also facilitates a system-level interpretation of the model. This interpretation is useful when the neuron model and its network are employed in process control architectures. The response of a single first-order neuron model to several types of input was examined by computer simulation. The first simulation, shown in Figure 2a, demonstrates the well-known nonlinear steady-state behavior of the baroreceptor neurons. Below the threshold blood pressure of 105 mmHg,
A Neuro-Mimetic Dynamic Scheduling Algorithm for Control
483
(a)
(b)
Figure 2: (a) Nonlinear steady-state behavior of a first-order neuron (baroreceptor). Spiking frequency is calculated as the inverse of the interspike interval. (b) Dynamic response of a first-order neuron to positive and negative step changes in blood pressure. The upper graph is the trace of the spiking frequency calculated as the inverse of the interspike interval. The lower graph shows the blood pressure signal. Continued.
484
Harpreet S. Kwatra et al.
(c)
Figure 2: Continued. (c) Response of a first-order neuron to a naturalistic blood pressure trace.
there is no response; the neuron starts spiking once the blood pressure exceeds the threshold value. Note that physiologically, there are about a hundred baroreceptors per nerve with pressure thresholds distributed from well below (35 mmHg) to well above (170 mmHg) the range of resting mean arterial pressure (Kirchheim 1976; Kumada et al. 1990). It is this distributed sensitivity to pressure that lends itself to the scheduling response of the neuronal network described in section 2.2. Figure 2b shows the response of the neuron to positive and negative step changes in blood pressure. The dynamic nature of the neuron is evident here as the sudden rise in blood pressure results in a large increase in the spiking frequency, which then decays down to a lower value (adaptation). Correspondingly, a drop in blood pressure causes the neuron to stop spiking altogether, and then recover to the steady-state spiking frequency (postexcitatory depression). These characteristics validate the neuron model against published experimental data (Brown et al. 1976; Kirchheim 1976). A typical electrophysiological experiment for a rat using a step in blood pressure of 10 mmHg was observed to raise the baroreceptor discharge from around 35 Hz to 75 Hz, which then adapted to around 50 Hz (Brown 1980). Similarly, the negative step in pressure caused a marked depression in discharge with subsequent recovery to the original steady-state spiking frequency.
A Neuro-Mimetic Dynamic Scheduling Algorithm for Control
485
Figure 2c depicts the response of the neuron to an input signal that is similar to the physiological blood pressure trace. The pressure signal has a sharp rise, a slower fall, and a time period of around 150 ms (in the range of the cardiac period of a rat; Doyle III et al. 1997). The figure shows the neuron to fire action potentials mostly on the rising phase of the pressure pulse, which is consistent with observed behavior (Kirchheim 1976; Karemaker 1987). This sensitivity to the rate of change of pressure (or dP/dt) critically contributes to the dynamic nature of the scheduling observed for the neuronal network described in the next section. 2.2 A Network of First-Order and Second-Order Neurons. The general architecture of a neuronal network with excitation and inhibition was discussed in the beginning of section 2. Specifically, a 4 × 1 neuronal structure, in which four first-order neurons converged onto one second-order neuron, was employed as a basic block for the network. Allowing for overlap between the first-order neurons, six of these blocks were combined to yield the neuronal network shown in Figure 1. Note that in this network, the second-order neurons “share” the inputs from the first-order neurons, so as to provide a narrower pressure sensitivity. As will be seen below, such a network architecture gives rise to a scheduled response from the second-order neurons, in specific narrow ranges of the input signal. Analysis of Dynamics of the Network. Just as for the individual neuron models, the response of the network is nonlinear and complex. In fact, the inhibition-excitation architecture of the network makes its response even more complicated and difficult to predict. Once again, this motivates an analysis of the dynamics of the network before it is employed in a simulation of the baroreceptor reflex. Static Response of the Network. First, the static response of the network is analyzed. By static, it is meant that the rate of change of the input signal is small. To implement this analysis, the input (blood pressure) is ramped up very slowly (at the rate of 0.6 mmHg/sec), and the action potentials for each of the second-order neurons are recorded. Note that the input signal drives the nine first-order neurons of Figure 1, and the output signals from these first-order neurons in turn drive the six second-order neurons. Figure 3a shows that the static response of the neurons is distributed in pressure space. The second-order neurons of the network fire in specific ranges (receptive fields) of the blood pressure input signal. This aspect of the network enables it to “schedule” the network input signal to the output of the network and may be exploited in algorithms for control. In fact, the shapes of the response graphs resemble one-dimensional radial basis functions (Hunt et al. 1992) and could be employed in a similar manner for modeling and control. Dynamic Response of the Network. By virtue of the fact that the first-order neurons in the network are intrinsically dynamic (see section 2.1), the outputs of the network exhibit significant dynamic behavior as well. This is demonstrated in Figure 3b, in which the responses of the second-order neu-
486
Harpreet S. Kwatra et al.
(a)
(b)
Figure 3: (a) Static receptive fields of the six second-order neurons of Figure 1. (b) Dynamic receptive fields of the six second-order neurons of Figure 1, at 10 Hz.
rons are plotted for the case when the network is driven by a sinusoid of frequency 10 Hz (close to the heart rate of a rat), added to a slowly ramping input signal (at the rate of 0.6 mmHg/sec). This input excites the dynamics of the network at the frequency of the sinusoid. The dynamic responses
A Neuro-Mimetic Dynamic Scheduling Algorithm for Control
487
appear sparser than the corresponding static ones seen in Figure 3a, indicating decreased overall network activity for inputs with high-frequency content. More interestingly, the figure shows that the response from each of the second-order neurons spreads beyond the static response range. For instance, the static receptive field for the lower-most neuron ranges from 3.0 to 6.0 units of input level, while the dynamic receptive field is spread from 3.0 to 7.2 units. Due to the overlap in the dynamic receptive fields observed in Figure 3b, more than one second-order neuron may be active at any time depending on the frequency content of the input signal. If the input level is rising, a second-order neuron may fire even when the input value is below its static range of response. This may happen because the frequency content in the input signal might be sufficient to excite its dynamics. Note that while the scheduling functionality of the network is due to its excitation-inhibition architecture (see Figure 1), the sensitivity to the dynamic component of the input is attributable to the dP/dt sensitivity of the first-order neurons. From the point of view of control, this dynamic nature of the network is especially relevant. The scheduling function of the network will have a dynamic component as well. In traditional control theory, gain scheduling is restricted to slowly time-varying scheduling variables, as changes in these are not explicitly taken into account in the control design. The dynamic network above would be expected to have interesting dynamic scheduling properties that could be exploited to optimize control performance in process system applications. 2.3 A Simple Model of the Cardiovascular System. The previous sections described the models of the first- and second-order neurons and the neuronal network. Two more blocks have to be specified to complete the model of the baroreceptor reflex. First, the sum of the outputs from the second-order neurons forms the input for an intermediate subsystem (see Figure 1), which is modeled as a simple linear filter. This dynamic system captures the effect of the neural circuitry that lies between the second-order neurons and the heart. This includes the interneurons in the NTS (Spyer 1994) and the motor neurons in the nucleus ambiguus (Loewy & Spyer 1990). Next, a first-order approximation of the cardiovascular dynamics is described. The blood pressure decays exponentially from a current level to the level P0 with the time constant τP . At selected time points, the pressure responds with a jump to the level Pm in response to the pumping action of the heart. One of the driving forces for this subsystem is the disturbance Pi , which represents the effects of an external agent (e.g., a drug infusion). The second input is the feedback signal from the intermediate subsystem. A more detailed description is available in Doyle III et al. (1997). On the afferent side, the input signal to the first-order neurons depends on the pressure P via the amount of stretch in the blood vessels, modeled here by a simple linear relationship, I = kP P, where kP is a “tuning” coefficient.
488
Harpreet S. Kwatra et al.
Figure 4: Closed-loop response of a simple network model of the baroreceptor reflex (time in ms).
2.4 Computer Simulation of the Baroreceptor Reflex. The analyses of the dynamics of the single neuron and the network provide a reasonable starting point for a simulation of the simple model of the baroreceptor reflex. For the model described, a network with four first-order neurons and one second-order neuron is employed. In the general case of multiple secondorder neurons, the sum of the outputs from the second-order neurons may be used. Figure 4 shows the responses of the four first-order neurons (the top two rows) and one second-order neuron (the lower left trace) to a fluctuating pressure signal (the bottom right trace). The cardiovascular system experiences a positive disturbance at time 500 ms, which causes the mean pressure to rise. Due to the barotopical distribution of thresholds, the first-order neurons respond sequentially to increasing mean blood pressure. Hence, the first-order neuron with the lowest threshold (number I) displays the greatest amount of activity. The middle pair of first-order neurons (numbers II and III) excites the second-order neuron, while the other two first-order neurons (numbers I and IV) inhibit it. When the feedback loop is open (not shown here), the persistent disturbance drives the mean pressure to levels much higher than the ones seen in Figure 4. When the feedback loop is closed, the second-order neuron provides the pressure control. As the pressure enters the sensitive range of the second-order neuron, a signal burst is sent to the intermediate block. This block drives the heart with a negative feedback signal, leading to a tem-
A Neuro-Mimetic Dynamic Scheduling Algorithm for Control
489
porary decrease in the pressure level. The persistent external signal drives the pressure up again, and the trend is repeated. The low-frequency bursts exhibited by the second-order neuron are similar to the electrophysiological recordings for NTS neurons (Rogers et al. 1993). While the network behavior of the proposed baroreflex model is a reasonable first approximation of the experimentally recorded neuronal behavior (Doyle III et al. 1997), there is ample scope for refinement. However, the simulation successfully illustrates the dynamic scheduled response of a network of complex neurons and how the neurons in this specific network can be employed to model and predict the behavior of the baroreceptor reflex. This architecture of the network can be exploited in process control algorithms, the subject of discussion in the second half of the article.
3 Applications to Nonlinear Process Control From a control perspective, an interesting feature of the network model of section 2.2 is that individual second-order neurons are active in a narrow static and dynamic range of pressure changes. While the scheduling activity is a result of the network architecture, its dynamic nature is due to the intrinsic behavior of the neurons themselves. These static and dynamic components of the scheduling activity were analyzed in section 2.2. The scheduling functionality of the neuronal network described by the receptive fields in Figures 3a and 3b is of a discrete nature. While the overlapping of the dynamic receptive fields may provide smoother transitions among the scheduled outputs or channels, the range in which a channel is active is nevertheless discrete (e.g., the lower-most neuron in Figure 3b is active from 3.0 to 7.2 units of input). This discrete nature can be traced back to the first-order neurons (baroreceptors), which are active only above their pressure thresholds. Motivated by this insight, a discrete gain scheduling approach is adopted in the current control design. Note, however, that the baroreceptor reflex employs a large number of baroreceptors and NTS neurons such that the discrete scheduling would approach a continuous nonlinear scheduling function. Based on the last observation, a dynamic gain scheduling controller of the continuous nonlinear form is being researched (Kwatra & Doyle III 1995). Gain scheduling, which has been in use in the chemical process industry for more than a decade, is a powerful method for compensating for process nonlinearities when some measurable indicator of changing plant dynam˚ om ics is available (Astr ¨ & Wittenmark 1989). It may then be possible to eliminate the influence of varying dynamics by changing the parameters of the regulator as a function of the measured scheduling variable. Typical applications include distillation column control (Shinskey 1977), pH control (Gulaian & Lane 1990), and batch process control (Luyben 1990). Experience has shown gain scheduling to be both effective and easy to use for process
490
Harpreet S. Kwatra et al.
Figure 5: Schematic of control structure I.
˚ om systems (Astr ¨ & H¨agglund 1990). Given its success, any improvements that build on its strengths would be clearly motivated. In classical gain scheduling, the gains (or other controller parameters) are scheduled according to the instantaneous values of the scheduling variable, and no convolution over time is involved. Even if the scheduling variable measures dynamic aspects of the system, variations in it are not explicitly dealt with during control design. For instance, although aircraft velocity (Mach number) is often used as the exogenous scheduling variable for the control of aircraft acceleration (Reichert 1992), the converse is not true; that is, acceleration is not normally employed as a scheduling variable for the control of aircraft velocity. Typically each controller is obtained for a fixed value of the scheduling variable, independent of past values and time. This feature of the traditional gain scheduling design methodology is directly responsible for its limitation to slow variations in the scheduling parameter (Shamma & Athans 1992). By contrast, the scheduling activity of the network of neurons of the earlier sections has a distinct dynamic component. The dynamic scheduling activity of the network holds promise for exploitation in scheduling algorithms for control of nonlinear process systems, as shown below. Two control architectures are described. First, a control scheme inspired by the baroreceptor reflex model is used to implement disturbance rejection in a conical tank. The shortcomings of this control scheme are addressed in a second control architecture that uses error feedback and a set point. This control scheme is demonstrated on a nonlinear CSTR with van de Vusse kinetics. 3.1 Control Structure I. The first control structure (see Figure 5) is based on the network control scheme used in the simulation of the baroreceptor reflex (see section 2.4). The cardiovascular system of Figure 1 is replaced
A Neuro-Mimetic Dynamic Scheduling Algorithm for Control
491
by the nonlinear process being controlled, with the output scaled to match the pressure variable that is fed to the network. The neuronal network is identical to that used in the baroreflex simulation except that the outputs of the second-order neurons are passed through proportional-integral (PI) controllers before being summed. The PI controllers facilitate additional tuning specific to the nonlinear process being controlled. The intermediate subsystem is the same as before: a linear first-order filter. This control design was tested on a simple nonlinear process system. The system under consideration is a first-order process with a gain that varies nonlinearly as a function of the process output. A physical example of such a system is a level-control problem in a tank with a cross-sectional area that varies with height (e.g., a conical or spherical holding tank). Case Study: Level Control in a Conical Tank. Holding tanks that have variation in cross-sectional area as a function of height are common in the chemical process industry. A conical tank model demonstrates control structure I. The differential equation of a conical tank can be written as τ y˙ = −y− 2 + ky−2 u + d, 3
(3.1)
where y (controlled variable) is the level of the fluid in the tank, u (manipulated variable) is the volumetric feed rate to the tank, d is a disturbance input, and τ and k are constants. This model was obtained from a material balance on the tank, assuming that the discharge coefficient is proportional √ to y. Note that the first two terms on the right-hand side of equation 3.1 are nonlinear in y. Regulator control of the conical tank was implemented with control structure I. The gains of the PI controllers were tuned so as to yield identical local responses around several operating points, as is done in input-output linearization techniques (Isidori 1989). The integral action of all the controllers was identical and chosen such that the disturbance was rejected within 60 sec. The simulation of the control system for a large negative disturbance is shown in Figure 6. The control for the same disturbance using a linear controller obtained by internal model control (IMC) design techniques (Morari & Zafiriou 1989) for the steady-state operating point is also shown. The graph suggests that the dynamic network controller adjusts for the rising fluid level in the tank and prevents the overshoot that occurs for the linear control. This dynamic aspect of the scheduling activity is of interest in scheduling control algorithms. Control structure I suffers from several shortcomings. First, although it demonstrated reasonable regulator control (disturbance rejection) in this example, it has no facility to implement servo control (set point tracking), a necessary component of most chemical process control applications because grade changeovers and set point adjustments are common. Second, the individual controllers in the feedback loop are driven by the outputs of
492
Harpreet S. Kwatra et al.
Figure 6: Regulator response of a conical tank controlled by a bank of PI controllers, which was dynamically scheduled by a 6 × 3 network.
Figure 7: Schematic of control structure II.
the second-order neurons, as opposed to the usual error feedback signal. The implication is that zero offset is not guaranteed in this control design. A second control structure was developed to overcome these problems. While control structure I was consistent with the neurobiological bases in the baroreceptor reflex model, control structure II was modified to meet the requirements of typical process control problems. 3.2 Control Structure II. Figure 7 shows the block diagram of control structure II. The scheduling network is grouped together in a block labeled “scheduler,” and the control elements are placed in the “regulator.” For instance, in the level control problem, the regulator block would simply contain the PI controllers, and the scheduler would have the network of
A Neuro-Mimetic Dynamic Scheduling Algorithm for Control
493
neurons. For the general control problem, the regulator block would contain an array of an appropriate type of controllers, such as PID controllers, and IMC controllers, and the scheduler block would have a suitably sized network of first-order and second-order neurons. A second, and perhaps debatable, modification is the introduction of a set point signal in the control loop. This is necessary from the process control point of view, because almost all process control problems involve setting and adjusting the set point to achieve the desired output from the process. From the point of view of neurophysiology, this is less justifiable because there is no known explicit set point for the baroreceptor reflex. However, since it is known that the central nervous system does have signals converging onto the crNTS (cardiorespiratory NTS) and the medulla (Spyer 1994), it may be reasonable to speculate that some type of implicit set point signal is conveyed to the network of second-order and higher-order neurons of the reflex. The auxiliary or scheduling variable from the process is fed back to the scheduler block, which contains the dynamic scheduling network of neurons. In Figure 7, the process output itself serves as the scheduling variable. Within the scheduler block, this signal forms the input to the first-order neurons. The outputs of the second-order neurons are passed onto the regulator block. The values of these outputs or channels vary between zero and around unity—zero denoting that the channel is inactive and unity indicating that the channel is fully active. This scheduling activity of the network is dynamic, as illustrated in section 2.2. As a result, two neighboring channels may be partially active as the scheduling variable makes the transition from one level to another. In addition, there is sensitivity to the rate of change of the scheduling variable, which makes the scheduling functionality truly dynamic. The feedback error signal is obtained by subtracting the process output from the set point and is then fed into the regulator block along with the scheduler channel outputs. Within the regulator block, each of the channel outputs is simply multiplied with the error signal and fed into the controller corresponding to that channel (see Figure 8). This activates the controllers corresponding to the active scheduler channels, in proportion to their level of activity. Hence, if only one channel is active, the error signal gets multiplied by unity and fed into the controller corresponding to that channel. If two neighboring channels are active with, say, outputs of 0.5 unit each, the error signal gets equally divided between the two corresponding controllers. The output from all the controllers is added to give the control input for the process. This feedback control loop is applied to a nonlinear chemical reactor in the following section. Case Study: Dynamic Scheduled Control of a Nonlinear CSTR. A Reaction with van de Vusse Kinetics. van de Vusse (1964) described a chemical reaction mechanism that has recently been used by several researchers as
494
Harpreet S. Kwatra et al.
Figure 8: Exploded view of the regulator block in control structure II.
a benchmark for nonlinear studies (Kantor 1986; Doyle III & Morari 1992; Engell & Klatt 1993). It involves both a consecutive and a parallel reaction that give rise to undesired by-products. The reaction equation can be written as: k1
k2
A −→ B −→ C k3
2A −→ D.
(3.2)
The control objective, then, is to maximize the rate of production of the desired product (B) and to minimize the rates of production of the byproducts (C and D) by way of manipulation of the catalyst and the reaction conditions (Kantor 1986). Engell and Klatt (1993) describe a specific example of the above chemical reaction in which cyclopentenol (B) is obtained from cyclopentadiene (A) by acid-catalyzed electrophilic addition of water in dilute solution. The undesired by-products are dicyclopentadiene (D) and cyclopentanediol (C). The reaction kinetics employ rate expressions for k1 , k2 , and k3 that obey the Arrhenius equation. The coolant dynamics can be neglected, leading to a third-order nonlinear dynamic system. This gives rise to a single input– single output (SISO) system in which the concentration of the desired product B is controlled by manipulating the inflow of A to the reactor. The differential equations describing the system and the particular parameters used in the simulations here are given in Engell and Klatt (1993). IMC Controller Design. The IMC design methodology was chosen to obtain the bank of controllers due to its relative ease in derivation and tuning (Morari & Zafiriou 1989). The CSTR dynamics were linearized around six
A Neuro-Mimetic Dynamic Scheduling Algorithm for Control
495
Table 1: Transfer Function Gains, Zeros, and Poles Transfer function
K
z1
z2
p1
p2
p3
I II III IV V VI
−2.8711e-05 −2.7708e-05 −2.6467e-05 −2.5000e-05 −2.3453e-05 −2.2592e-05
−0.0074 −0.0059 −0.0044 −0.0029 −0.0014 −0.0002
0.0235 0.0282 0.0328 0.0365 0.0368 0.0307
−0.0357 −0.0333 −0.0304 −0.0267 −0.0219 −0.0165
−0.0136 + 0.0056i −0.0122 + 0.0047i −0.0107 + 0.0037i −0.0091 + 0.0026i −0.0072 + 0.0015i −0.0053 + 0.0009i
−0.0136 − 0.0056i −0.0122 − 0.0047i −0.0107 − 0.0037i −0.0091 − 0.0026i −0.0072 − 0.0015i −0.0053 − 0.0009i
equally spaced points in the operating region of 0.8 to 1.0 mol/` product concentration CB (Engell & Klatt 1993). The resulting six transfer functions were of the type G(s) = K
(s − z1 )(s − z2 ) . (s − p1 )(s − p2 )(s − p3 )
(3.3)
The zero z2 is in the right s-half plane, while the other zero and the poles are all in the left s-half plane (see Table 1). Integral square error (ISE) optimal IMC controllers were derived for the six transfer functions, and first-order filters were appended to them. These are equivalent to classical feedback controllers of the form C(s) = (−1/K)
(s − p1 )(s − p2 )(s − p3 ) . s(s − z1 )(λs + λz2 + 2)
(3.4)
The six IMC-type controllers were placed in the “regulator” block of the feedback control loop of Figure 7. The IMC filter time constant λ was set to 20 seconds. Simulation Results. The IMC-type controllers in the regulator block are dynamically scheduled by a 9×6 network of nine first-order and six secondorder neurons placed in the scheduler block. The simulation is implemented using MATLAB software. Four simulation runs are presented here: three with set point changes in the concentration of B (servo problem) and one with a disturbance input in the inlet concentration of A (regulator problem). The parameters for the CSTR model, set point changes, and disturbance input were chosen to be identical to those used by Engell and Klatt (1993) to facilitate a direct comparison. Their approach was based on traditional static gain scheduling. Their simulation results are reproduced here and are characterized by overshoots of up to 60% and response times of about 0.15 to 0.30 hour.
496
Harpreet S. Kwatra et al.
Figure 9: Servo control for a step set point change in CB from 0.85 to 0.95 mol/`; CAO is 5.1 mol/`.
The first dynamic scheduled control simulation using the network of neurons and IMC-type controllers is for a step change in set point of CB from 0.85 to 0.95 mol/`. Figure 9 shows that CB rises to 0.95 mol/` with a response time of about 0.1 hour and with no significant overshoot. This is distinctly better than the static scheduling approach, which had an overshoot of about 40%. The second simulation is for a set point change in CB from 0.85 to 0.86 mol/`. The response time for this was around 0.1 hour, with no appreciable overshoot (see Figure 10). This compared favorably with the static scheduling approach, which had a response time of 0.3 hour and around 60% overshoot. In addition, the static scheduling control displayed excessive oscillations. Figure 11 shows the case of set point change in CB from 0.95 to 0.85 mol/`. It shows an inverse response, which is larger than that of the static scheduling approach; however, it is improved in the sense that it has no overshoot and a smaller response time (see Table 2). Finally, Figure 12 shows the simulation of the regulator problem in which the control loop attempts to reject the step disturbance in CAO from 4.5 to 5.7 mol/`. The result is better in the dynamic scheduled control case; there is no overshoot again, and the response times are comparable. Discussion. A characteristic feature of the dynamic scheduled control simulations is that the controlled output trajectories are not smooth. This is a direct outcome of employing neurons in the scheduling network; the firing
A Neuro-Mimetic Dynamic Scheduling Algorithm for Control
497
Figure 10: Servo control for a step set point change in CB from 0.85 to 0.86 mol/`; CAO is 5.7 mol/`.
Figure 11: Servo control for a step set point change in CB from 0.95 to 0.85 mol/`; CAO is 4.5 mol/`.
498
Harpreet S. Kwatra et al.
Table 2: Summary of Results for the Nonlinear CSTR Case Study Settling time Magnitude of input
Overshoot
Static Dynamic Static Dynamic scheduling (h) scheduling (h) scheduling (%) scheduling (%)
0.10 set point
0.16
0.1
40
0
0.01 set point
0.30
0.1
60
0
−0.10 set point
0.15
0.1
3
0
0.15
0.15
0.12 disturbance
Figure 12: Regulator control for a step disturbance in CAO from 4.5 to 5.7 mol/`; set point CB is 0.9 mol/`.
of the neurons is not smooth but is composed of spikes and intermediate durations of inactivity. Smooth process curves are desirable, and future work will incorporate a low-pass filter to accomplish this. The individual linear controllers employed in the regulator block are themselves guaranteed stable functions due to the IMC methodology used. Around the steady states that each one of them is designed at, they provide stable closed-loop control, optimal in the integral square error sense. However, away from the steady states and during transitions, stability and performance would need to be inferred from extensive simulations or robustness analysis.
A Neuro-Mimetic Dynamic Scheduling Algorithm for Control
499
Overall, the dynamic scheduled control scheme performed better than the traditional static gain scheduled approach employed by Engell and Klatt (1993), especially in terms of reduced process overshoot. Conceptually, this is understandable, because the dynamic scheduler compensates for a controlled variable that is changing quickly and therefore prevents an overshoot. The simulations demonstrate the viability of the dynamic scheduled control scheme for process control problems. The next step would be to apply the algorithm to a complex nonlinear control problem of industrial relevance. 4 Conclusion and Future Work This article consists of two main parts. The first part describes models of neurons and networks, and their application in simulation of the baroreceptor reflex. Dynamics of the single neuron model are analyzed by simulations, which revealed remarkable similarity of the model responses with the static and dynamic characteristics of the mammalian baroreceptors. A network of first-order and second-order neurons is simulated, and the static and dynamic receptive fields of the second-order neurons are traced. This analysis provided a basis for building a simple network model of the baroreceptor reflex. The proposed neuronal network model exhibits control of blood pressure disturbance in a fashion similar to that observed experimentally (Doyle III et al. 1997). These simulation and analysis results are the first of their kind in modeling efforts of the baroreceptor reflex. The second part of the article reports on application of the network model of the baroreflex to nonlinear process control problems. The dynamic scheduling activity of the network is exploited in two control architectures. Control structure I is directly abstracted from the baroreceptor reflex model and is demonstrated on a nonlinear tank level control problem. Control structure II is developed to provide tools that are necessary for meaningful application in process control. This includes introduction of a set point and use of the feedback error signal. The control structure is illustrated on a nonlinear CSTR with van de Vusse kinetics. The simulation results show better control than traditional gain scheduling, especially in terms of reduced process overshoot. The two case studies validate the usefulness of the dynamic scheduled control approach. A few comments on the issue of closed-loop stability for the dynamic scheduled control algorithm are relevant at this point. As is the case with classical gain scheduling, only local stability can be easily guaranteed, and global stability is inferred by extensive simulations. In general, global stability can be guaranteed under the conditions of sufficiently slow variations in the scheduling variable (Shamma & Athans 1992), but these bounds are difficult to obtain and usually very conservative. In light of the fact that the dynamic scheduled control approach implicitly accounts for fast changes in the scheduling variable, it is plausible that the related bounds on the
500
Harpreet S. Kwatra et al.
scheduling variable rate of change would be less conservative. Also note that for nonlocal operation of the controllers, their net contribution is a weighted (between 0 and 1) sum of the individual control actions. Thus, the weights themselves will not lead to unbounded action and the associated instability. A rigorous proof of stability during these transitions shall be obtained with potentially conservative conic sector bounding (Doyle III et al. 1989) in future work. The formulation of such bounds will be facilitated by the structured form of the dynamic scheduling controller; equal to a weighted contribution of several linear controllers. The nonlinearity arises from the dynamic nature of the weighting function. Since each weighting function wi is bounded (between 0 and 1), it can be cone bounded using a time-varying gain 1i as: x˙ i = A0i xi + 1i A1i xi + bi ui yi = ci xi .
(4.1)
If the process being controlled can be similarly conic sector bounded using a time-varying gain, say, 1p , then it would be possible to combine all the time-varying gains in the closed-loop in the M-1 standard structured uncertainty form (Doyle 1982). Using extensions for time-varying and nonlinear uncertainty (Doyle III et al. 1989), conservative stability conditions could be derived. The success of the dynamic gain scheduling algorithm is primarily due to the fact that while the traditional scheduled control approaches are built on the slowly varying scheduling variable assumption, the current approach employs an explicitly dynamic scheduling mapping. In the neuronal network implementation of the dynamic scheduled control algorithm, the dynamic scheduling is realized by the dynamic nature of the neurons, especially the dP/dt sensitivity of the first-order neurons. Note that although neural net–based scheduling schemes (such as a radial basis function neural net) would lead to a simpler controller by avoiding the complex neuron dynamics, these would typically implement a static scheduling controller. Although dynamics could be introduced in the net by using feedback, the nodes are nevertheless static operators. The neuronal network used in this article employs dynamic neuronal blocks or nodes and is therefore inherently dynamic. On the other hand, all of the complex neuron dynamics may not contribute to the performance of the network controller. For the actual control of nonlinear process systems, the use of such neuronal networks may be unrealistic. Therefore, the underlying principles of the dynamic scheduled control exhibited by the above neuronal network need to be extracted and implemented in a nonlinear control framework (Kwatra & Doyle III 1995). Through continued collaborative research activity, future work will seek to (1) increase the level of complexity in the neuron models to include multiple channel kinetics, (2) refine the system level models such as the model of the cardiovascular system, (3) analyze robustness properties of the pro-
A Neuro-Mimetic Dynamic Scheduling Algorithm for Control
501
posed dynamic scheduled control architectures, and (4) extract the underlying principles of the dynamic scheduled control exhibited by the neuronal network and apply them to nonlinear control theory. Acknowledgments The authors gratefully acknowledge support from the National Science Foundation (grant BCS-9315738). References ˚ om, Astr ¨ K. J., & H¨agglund, T. (1990). Practical experiences of adaptive techniques, Proc. of American Control Conference, San Diego (pp. 1599–1606). ˚ om, Astr ¨ K. J., & Wittenmark, B. (1989). Adaptive control (2nd ed.). Reading, MA: Addison-Wesley. Bradd, J., Dubin, J., Due, B., Miselis, R. R., Monitor, S., Rogers, W. T., Spyer, K. M., & Schwaber, J. S. (1989). Mapping of carotid sinus inputs and vagal cardiac outputs in the rat. Soc. Neurosci. Abstr., 15, 593. Brown, A. M. (1980). Receptors under pressure. An update on baroreceptors. Circulation Research, 1(46), 1–10. Brown, A., Saum, W., & Tuley, F. (1976). A comparison of aortic baroreceptor discharge in normotensive and spontaneously hypertensive rats. Circulation Research, 39, 488–496. Czachurski, J., Dembowsky, K., Seller, H., Nobiling, R., & Taugner, R. (1988). Morphology of electrophysiologically identified baroreceptor afferents and second order neurons in the brainstem of the cat. Arch. Ital. Biol., 126, 129–144. Donoghue, S., Garcia, M., Jordan, D., & Spyer, K. (1982). Identification and brainstem projections of aortic baroreceptor afferent neurons in nodose ganglia of cats and rabbits. J. Physiol. Lond., 322, 337–352. Doyle III, F. J., & Morari, M. (1992). A survey of approximate linearization approaches for chemical reactor control. DYCORD+ 92, IFAC, pp. 207–212. Doyle III, F. J., Henson, M. A., Ogunnaike, B. A., Schwaber, J. S., & Rybak, I. (1997). Neuronal modeling of the baroreceptor reflex with applications in process modeling and control. In D. Elliott (Ed.), Neural networks for control. New York: Springer-Verlag. Doyle III, F. J., Kwatra, H., Rybak, I., & Schwaber, J. S. (1994). A biologicallymotivated dynamic nonlinear scheduling algorithm for control. Proc. of the American Control Conference, Baltimore (pp. 92–96). Doyle III, F., Packard, A., & Morari, M. (1989). Robust controller design for a nonlinear CSTR. Chem. Eng. Sci., 44, 1929–1947. Doyle, J. C. (1982). Analysis of feedback systems with structured uncertainties. IEEE Proc., D, 6, 242–250. Engell, S., & Klatt, K. (1993). Nonlinear control of a non-minimum-phase CSTR. Proc. of the American Control Conference, San Francisco (pp. 2941–2945). Gulaian, M., & Lane, J. (1990). Titration curve estimation for adaptive pH control. Proc. of American Control Conference, San Diego (p. 1414).
502
Harpreet S. Kwatra et al.
Hubel, D., & Wiesel, T. (1962). Receptive fields, binocular integration and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106–154. Hunt, K., Sbarbaro, D., Zbikowski, R., & Gawthrop, P. (1992). Neural networks for control systems—a survey. Automatica, 28(6), 1083–1112. Isidori, A. (1989). Nonlinear control systems—An introduction (2nd ed.). New York: Springer-Verlag. Kantor, J. C. (1986). Stability of state feedback transformations for nonlinear systems—some practical considerations. Proc. of the American Control Conference, Seattle (pp. 1014–1016). Karemaker, J. M. (1987). Neurophysiology of the baroreceptor reflex. In R. I. Kitney & O. Rompelman (Eds.), The beat-by-beat investigation of cardiovascular function (pp. 27–49). Oxford: Clarendon Press. Kirchheim, H. (1976). Systematic arterial baroreceptor reflexes. Physiological Reviews, 56(1), 100–176. Kumada, M., Terui, N., & Kuwaki, T. (1990). Arterial baroreceptor reflex: Its central and peripheral neural mechanisms. Prog. Neurobiol., 35, 331–361. Kwatra, H. S., & Doyle III, F. J. (1995). Dynamic gain scheduled control. AICHE Annual Meeting, Miami Beach. Manuscript submitted. Loewy, A., & Spyer, K. (1990). Vagal preganglionic neurons. In A. D. Loewy & K. M. Spyer (Eds.), Central regulation of autonomic functions (pp. 68–87). New York: Oxford University Press. Luyben, W. L. (1990). Process modeling, simulation, and control for chemical engineers. New York: McGraw-Hill. Morari, M., & Zafiriou, E. (1989). Robust process control. Englewood Cliffs, NJ: Prentice-Hall. Reichert, R. (1992). Dynamic scheduling of modern-robust-control autopilot designs for missiles. IEEE Control Systems Magazine, 12, 35–42. Rogers, R. F., Paton, J. F. R., & Schwaber, J. S. (1993). NTS neuronal responses to arterial pressure and pressure changes in the rat. Am. J. Physiol., 265, R1355– R1368. Schwaber, J. (1987). Neuroanatomical substrates of cardiovascular and emotional-autonomic regulation. In A. Magro, W. Osswald, D. Reis, & P. Vanhoutte (Eds.), Central and peripheral mechanisms in cardiovascular regulation (pp. 353–384). New York: Plenum Press. Schwaber, J., Graves, E., & Paton, J. (1993). Computational modeling of neuronal dynamics for systems analysis: Application to neurons of the cardiorespiratory NTS in the rat. Brain Research, 604, 126–141. Shamma, J. S., & Athans, M. (1992). Gain scheduling: Potential hazards and possible remedies. IEEE Control Systems Magazine, 12, 101–107. Shinskey, F. G. (1977). Distillation control: For productivity and energy conservation. New York: McGraw-Hill. Spyer, K. (1994). Central nervous mechanisms contributing to cardiovascular control. J. Physiol., 474.1, 1–19. van de Vusse, J. (1964). Plug-flow type reactor versus tank reactor. Chem. Eng. Sci., 19, 994–997.
Received August 7, 1995; accepted May 28, 1996.
NOTE
Communicated by Michael Hines
Conductance-Based Integrate-and-Fire Models Alain Destexhe Department of Physiology, Laval University School of Medicine, Qu´ebec, G1K 7P4, Canada
A conductance-based model of Na+ and K+ currents underlying action potential generation is introduced by simplifying the quantitative model of Hodgkin and Huxley (HH). If the time course of rate constants can be approximated by a pulse, HH equations can be solved analytically. Pulse-based (PB) models generate action potentials very similar to the HH model but are computationally faster. Unlike the classical integrate-andfire (IAF) approach, they take into account the changes of conductances during and after the spike, which have a determinant influence in shaping neuronal responses. Similarities and differences among PB, IAF, and HH models are illustrated for three cases: high-frequency repetitive firing, spike timing following random synaptic inputs, and network behavior in the presence of intrinsic currents.
1 Introduction Over 40 years ago, Hodgkin and Huxley introduced a remarkably successful quantitative description of action potentials (Hodgkin & Huxley, 1952). More recent techniques revealed that transmembrane ionophores underlie ionic currents and that the assumptions of the Hodgkin-Huxley (HH) model were essentially correct if reinterpreted in the light of current knowledge on the activation of ion channels (see Hille, 1992). Although Markov kinetic models describe the properties of sodium channels more adequately (Vandenberg & Bezanilla, 1991), the success of the HH model is due to its “observability,” as its parameters are easily evaluated from voltage-clamp experiments. It remains the most widely used description of voltage-dependent conductances. The drawback of using HH models is that they require solving a set of several first-order differential equations in addition to cable equations. Moreover, numerical integration of these equations can require relatively small integration steps due to the nonlinearity and fast kinetics of the Na+ and K+ currents underlying action potentials. This fine integration is then imposed to the whole system, which may have slower currents that would not require such a high temporal resolution but may result in needs of considerable computation time for simulating large-scale neuronal networks. Neural Computation 9, 503–514 (1997)
c 1997 Massachusetts Institute of Technology °
504
Alain Destexhe
An alternative approach was to simplify the process of action potential generation by representing the neuron as an integrate-and-fire (IAF) device (see Arbib, 1995). In this case, the neuron produces a spike when its membrane potential crosses a given threshold value and the membrane is instantaneously reset to its resting level. Although this type of model has been extensively used for representing large-scale neuronal networks, an important drawback is that it neglects the variations of Na+ and K+ conductances, which may have important effects in spike generation (Mainen, 1995). Moreover, central neurons have been shown to contain a large spectrum of other intrinsic currents, which have a determinant effect on firing behavior (Llin´as, 1988). In this case, the IAF paradigm is clearly inappropriate. In this article, we propose a simplified model of action potentials based on HH equations. Besides its similarity with IAF models, this method is shown to capture the conductance changes due to action potentials, which has a determinant influence on electrophysiological behavior. 2 Pulse-Based Models for Voltage-Dependent Currents We start from the equations of Hodgkin and Huxley (1952): Cm V˙ = −gleak (V − Eleak ) − g¯ Na m3 h (V − ENa ) − g¯ Kd n4 (V − EK ) ˙ = αm (V) (1 − m) − βm (V) m m h˙ = αh (V) (1 − h) − βh (V) h n˙ = αn (V) (1 − n) − βn (V) n
(2.1)
where V is the membrane potential, Cm = 1 µF/cm2 is the membrane capacitance, and gleak = 0.1 mS/cm2 and Eleak = −70 mV are the leak conductance and reversal potential, respectively. g¯ Na = 100 mS/cm2 and g¯ Kd = 80 mS/cm2 are the maximal conductances of the sodium current and delayed rectifier with reversal potentials of ENa = 50 mV and EK = −90 mV. m, h, and n are the activation variables, whose time evolution depends on the voltage-dependent rate constants αm , βm , αh , βh , αn , and βn . The voltagedependent expressions of the rate constants were as in the version described by Traub and Miles (1991). The behavior of the HH model in a single compartment cell during a train of action potentials is shown in Figure 1 (left panel). As a consequence of the rapid changes in membrane voltage during action potentials, the time course of the rate constants exhibits sharp transitions. This is best illustrated by the behavior of αm : the function αm (V) is approximately zero for the whole range of membrane potential below ∼ −50 mV, then increases progressively with depolarization; consequently, before and after the action potential, αm ∼ 0 and has positive values only during the spike (see Fig. 1).
Conductance-Based Integrate-and-Fire Models
505
Figure 1: Comparison of action potential generation in Hodgkin-Huxley and pulse-based models. A train of spikes was generated by injecting a depolarizing current pulse (2 nA) in a single compartment model (area of 15,000 µm2 ; g¯ Kd = 30 ms/cm2 ; other parameters as in equation 2.1). The time course of the rate constants (αm , βm , αh , βh , αn , and βn ) and state variables (m, h, and n) is shown for each panel. In the Hodgkin-Huxley model (left panel), rate constants undergo sharp transitions when a spike occurs. Approximating these transitions by pulses (right panel) led to a simplified representation of spikegenerating mechanisms (threshold of −53 mV, pulse duration of 0.6 ms, rate values as in equation 4).
The basis of the simplified model is to approximate the time course of the rate constants by a pulse triggered when the membrane potential crosses a
506
Alain Destexhe
given threshold. For example, αm will be assigned a positive value during the pulse and will be zero otherwise. Generalizing for other rate constants, before and after the pulse, we have: αm = 0,
βm = β M
αh = αH ,
βh = 0
αn = 0,
βn = βN
(2.2)
whereas during the pulse: αm = α M ,
βm = 0
αh = 0,
βh = βH
αn = αN ,
βn = 0.
(2.3)
Here, αM , βM , αH , βH , αN , and βN are constant values. They are estimated from the value of the voltage-dependent expressions at hyperpolarized and depolarized potentials. Choosing −70 mV and +20 mV and using Traub and Miles’s formulation gives: αM = αm (20) ' 22 ms−1 βM = βm (−70) ' 13 ms−1 αH = αh (−70) ' 0.5 ms−1 βH = βh (20) ' 4 ms−1 αN = αn (20) ' 2.2 ms−1 βN = βn (−70) ' 0.76 ms−1 .
(2.4)
It is then straightforward to solve equations 2.1, leading to the following expressions: before and after the pulse, the rate constants are given by: m(t) = m0 exp [−βM (t − t0 )] h(t) = 1 + (h0 − 1) exp [−αH (t − t0 )] n(t) = n0 exp [−βN (t − t0 )]
(2.5)
where t0 is the time at which the last pulse ended, and m0 , h0 , and n0 are the values of m, h, and n at that time. During the pulse, the rate constants are given by: m(t) = 1 + (m0 − 1) exp [−αM (t − t0 )] h(t) = h0 exp [−βH (t − t0 )] n(t) = 1 + (n0 − 1) exp [−αN (t − t0 )].
(2.6)
Conductance-Based Integrate-and-Fire Models
507
Here, t0 is the time at which the pulse began, and m0 , h0 , and n0 are the values of m, h, and n at that time. Using this procedure, the simplified model generated action potentials similar to the HH model (see Fig. 1, right panel). The time course of the activation variables m, h, and n was also comparable in both models. The optimal values of parameters were estimated by fitting expressions 2.5 and 2.6 directly to the original HH model. Using a simplex procedure (Press et al., 1986), and starting from different initial values of the parameters, the fitting led to a unique estimate of the pulse duration (0.6 ± 0.08 ms) and of the membrane threshold (−50.1 ± 0.3 mV; other parameters were as in Fig. 1). The computational efficiency of the pulse-based mechanism is essentially due to the direct estimation of variables m,h,n from the membrane potential (see equations 2.5 and 2.6). Another factor providing further acceleration is that outside the pulse, an expression similar to equation 2.5 can be written for m3 and n4 , which greatly reduces the number of multiplications since no exponentiation is calculated. 3 Comparison of Pulse-Based Models with Other Mechanisms The important point of pulse-based (PB) models is that they provide a good approximation of dynamical properties of firing behavior described by the HH mechanism. These properties were tested in three ways: repetitive firing, random firing, and network behavior. First, the optimized PB model was compared to HH and IAF mechanisms following injection of current pulses of increasing amplitudes (see Fig. 2). Pulse-based models gave an excellent approximation of interspike intervals at different firing frequencies, the spike shape, and the rise and decay of the membrane potential. On the other hand, the IAF model taken in the same conditions gave a poor approximation (see Fig. 2). For IAF models, the threshold or the absolute refractory period affected the frequency of firing, but variations of these parameters failed to generate the correct firing frequencies. Second, PB models were tested in the case of random synaptic bombardment in a single-compartment cell having AMPA and GABAA postsynaptic receptors (see Fig. 3). The conductance of synaptic currents was chosen such that most postsynaptic potentials were subthreshold, occasionally leading to firing. This subthreshold dynamics is surprisingly irrelevant for spiking as PB and IAF mechanisms behaved very closely to the HH model (see Fig. 3), with spike timing differences of 1.1 ± 1.8 ms (average ± standard deviation) and rare occurrences of nonmatching spikes. The precision in spike timing was very sensitive to the value of the threshold and slightly better for PB models. Third, PB models were tested in network simulations having both intrinsic and synaptic currents, besides INa and IKd . The network considered was a model of spindle oscillations in interconnected thalamocortical (TC) and
508
Alain Destexhe
Figure 2: Comparison of repetitive firing with three different models. A single compartment cell (same parameters as in equation 1) was simulated with injection of three depolarizing current pulses of increasing amplitudes (1 nA, 2 nA, and 4 nA). Top trace: Simulation using the original Hodgkin-Huxley model. Middle trace: Simulation using pulse-based equations (identical parameters as in Figure 1, right). Bottom trace: Simple integrate-and-fire model (the spike consisted of a sudden rise to 50 mV, immediately followed by a reset to −90 mV and an absolute refractory period of 1.5 ms; same threshold as the pulse-based mechanism). All models had identical passive properties.
thalamic reticular (RE) neurons (Destexhe, Bal, McCormick, & Sejnowski, 1996). These neurons have a set of intrinsic sodium, potassium, calcium, and cationic currents, allowing the cell to produce different types of bursts of action potentials (see Steriade, McCormick, & Sejnowski, 1993). Although single TC and RE cells do not necessarily generate sustained oscillations, interconnected TC and RE cells with excitatory (AMPA-mediated) and inhibitory (GABAergic) synapses can display spindle oscillations, as seen from
Conductance-Based Integrate-and-Fire Models
509
Figure 3: Comparison of pulse-based and original Hodgkin-Huxley models using random synaptic bombardment. Top panel: Single compartment model (same parameters as in equation 1) with 100 excitatory (AMPA-mediated) and 10 inhibitory (GABAA -mediated) synaptic inputs. Synaptic currents were described by kinetic models (Destexhe et al., 1997) and had maximal conductances of 1.5 and 6 nS for each simulated AMPA- and GABAA -mediated contact, respectively. Each synaptic input was random (Poisson distributed) with a mean rate of 50 Hz. Middle panel: Same simulation with action potentials generated using pulse-based models. Bottom panel: Same simulation using integrate-and-fire mechanisms.
in vivo and in vitro recordings (Steriade et al., 1993). This behavior was successfully reproduced by the model using either HH equations or PB models (see Fig. 4). As in the case of synaptic bombardment, action potential timing
510
Alain Destexhe
Figure 4: Comparison of simplified and original Hodgkin-Huxley models using simulations of circuits of neurons. Left panel: Model of spindle oscillations in a network of 16 thalamic neurons. Eight thalamocortical (TC) cells projected to 8 thalamic reticular (RE) neurons using AMPA-mediated contacts, whereas RE cells contacted themselves using GABAA receptors and TC cells using a mixture of GABAA and GABAB receptors. Each neuron had Na+ , K+ , Ca+ , and cationic voltage-dependent currents described by Hodgkin-Huxley–type equations, whose parameters were obtained from voltage-clamp and current-clamp data (Destexhe et al., 1996). One cell of each type is represented during a spindle sequence. Middle panel: Same simulation using pulse-based mechanisms for action potentials. Right panel: Same simulation using integrate-and-fire models.
was not rigorously identical in the two models, but spindle oscillations had the correct frequency and phase relationships between cells. On the other hand, the high-frequency firing of TC and RE cells was not well captured by IAF mechanisms, which gave rise to different oscillation frequencies at the network level (see the right panel of Fig. 4). Finally, the computational performance was compared for different mechanisms of spike generation in a single compartment cell. Table 1 shows that the PB method is significantly faster than differential equations (DE). According to the algorithms available with the NEURON simulator (Hines, 1993), the integration of HH equations can be optimized (DEo) by using precalculated rate constants, tabulated as a function of V. Although these optimized HH equations led to substantial acceleration (DEo vs. DE), they were not as efficient as PB models, which estimate m, h, and n directly from V. Simple IAF models showed the highest performance.
Conductance-Based Integrate-and-Fire Models
511
Table 1: Computational Performance of Different Methods for Generating Action Potentials. Method
DE DEo PB IAF
NEURON (relative CPU time)
Minimal C Code (relative CPU time)
1 0.32 0.25 0.23a
1 0.45 0.17 0.08
Note: A single-compartment cell comprising passive currents, 1 nA injected current, and a model for action potentials was simulated. Different models for action potentials were: DE: Hodgkin-Huxley model described by differential equations; DEo: same equations solved using optimized algorithms; PB: pulse-based model; IAF: simple integrate-andfire model. A backward Euler integration scheme was used with a time step of 0.025 ms over a total simulated time of 50 seconds. The equations were solved using either the NEURON simulator (Hines, 1993) or directly in C code (minimal code for solving the equations without graphic interface; the exact same algorithms and optimizations as in NEURON were used). Codes are available on request. a IAF models could not be implemented optimally in NEURON.
4 Discussion This article has presented a novel method to approximate the dynamics of action potentials. The model was based on the HH equations in which the time course of rate constants was approximated by a short pulse. In a previous article (Destexhe, Mainen, & Sejnowski, 1994a), similar PB mechanisms were introduced to model synaptic currents. The basis of the synaptic model was that artificially applied pulses of agonist gave rise to postsynaptic currents with a time course very similar to the intact synapse (e.g., Colquhoun, Jonas, & Sakmann, 1992). PB models of synaptic currents offered the advantages of being simple and fast, and they could be easily fit to experimental data. In the case reported in this article, the pulse approximation of HH equations was based on the observation that rate constants undergo sharp transitions during action potentials (see Fig. 1), also leading to analytic resolution and therefore fast algorithms that can be fit easily to any template. In principle, the same approximation could be considered for other types of channels, under the important condition that rate constants display sharp transitions. A possible example would be Ca2+ -dependent channels, assuming that an action potential triggers a short pulse of intracellular Ca2+ concentration due to high-threshold Ca2+ currents. Are PB models different from conventional IAF mechanisms ? Both methods share a fixed voltage threshold as well as a dead time following spikes to represent the absolute refractory period. PB models, however, significantly
512
Alain Destexhe
differ from IAF models because they take into account the decay of variables m, h, and n in between spikes, which has a determinant influence on repetitive firing at high frequencies. In IAF models, the membrane is reset instantaneously after a spike to a hyperpolarized level. In PB models, the reset occurs through the activation of the delayed-rectifier current, which has an effect over a longer period of time due to the relatively slow decay of the variable n. This is the main factor that affected the frequency of firing in Figure 2. However, we cannot exclude the possibility that more sophisticated IAF models (with relative refractory period or variable threshold) would reproduce these properties, although it still remains to be demonstrated. Are PB models different from precalculated conductance waveforms? The model presented here is more sophisticated than simply pasting a conductance waveform, as the changes in variables m, h, and n depend on their initial values at the time of the spike. Conductance changes are different during low- and high-frequency firing, and PB models propose simplified equations to capture these changes. This is similar to comparing alpha functions with PB models of synaptic currents: Alpha functions do not provide correct postsynaptic summation, whereas it is well captured by PB models (see Destexhe, Mainen, & Sejnowski, 1994b). When do PB models fail? It was shown here that PB models approximate well HH equations for high-frequency firing during injection of current pulses (see Fig. 2), following random synaptic inputs (see Fig. 3), or in network simulations where intrinsic and synaptic currents were present (see Fig. 4). However, as the spike is defined from a fixed threshold, PB models would not apply to cases with significant subthreshold contribution of the HH mechanism in determining spike timing or if there are dynamical changes in the threshold. PB models cannot be applied to model various properties of sodium channels, such as the progressive spike inactivation during bursting, anode-break excitation, voltage-dependent recovery from inactivation, and more generally the response of these channels during voltage-clamp experiments. In these cases, spike generation should be modeled with the original HH equations or a more accurate mechanism. How do PB models compare to other types of simplifications of HH equations? Several models were proposed to reduce HH equations to fewer variables (e.g., Fidzhugh, 1961; Krinskii & Kokoz, 1973; Hindmarsh & Rose, 1982; Rinzel, 1985; Kepler, Abbott, & Marder, 1992). Another type of simplification was to start from detailed kinetic models for the gating of Na+ or K+ channels and then simplify this scheme until reaching the minimal number of states to account for the behavior of the current (Destexhe et al., 1994b). Pulse-based models are different from these previous approaches; the simplifying assumptions of these models lead to a complete analytical resolution of HH equations, and therefore no differential equation must be solved (see also Miller, 1979). How large is the increase of computational efficiency? PB models provide a substantial acceleration compared to HH equations, approaching the per-
Conductance-Based Integrate-and-Fire Models
513
formance of IAF models (see Table 1). However, this increase of efficiency could be minimal in the case of network simulations where action potentials represent only a small fraction of the computation time. On the other hand, if PB models could be used in conjunction with pulse-based mechanisms applied to other types of currents, such as synaptic currents (Destexhe et al., 1994a; Lytton, 1996), then this approach could provide a considerable acceleration of computation time. How can PB models be improved? An obvious improvement would be to apply the PB approximation to a simplified model of action potentials, such as that of Hindmarsh and Rose (1982), leading to PB models with fewer variables. Another improvement would be to add a relative refractory period or a variable threshold, for example, by making the threshold depend on the value of the inactivation variable h. More optimal values of rate constants than equation 2.4 could also be obtained by fitting the model directly to HH equations using several templates, such as that of Figs. 2 and 3. In conclusion, PB models capture important features of the firing behavior as described by HH mechanisms, but they do not apply to all cases. This approach is intermediate between simple IAF and more detailed biophysical models and may facilitate building large-scale network simulations that incorporate the intrinsic electrophysiological properties of single cells. Acknowledgments All simulations were run on a Sun Sparc 20 workstation using the NEURON simulator (Hines, 1993) or using programs written in C. Research was supported by MRC (Medical Research Council of Canada), and start-up funds came from FRSQ (Fonds de la Recherche en Sant´e du Qu´ebec). References Arbib, M. (1995). The handbook of brain theory and neural networks. Cambridge, MA: MIT Press. Colquhoun, D., Jonas, P., & Sakmann, B. (1992). Action of brief pulses of glutamate on AMPA/KAINATE receptors in patches from different neurons of rat hippocampal slices. J. Physiol., 458, 261–287. Destexhe, A., Mainen, Z., & Sejnowski, T. J. (1997). Kinetic models of synaptic transmission. In C. Koch & I. Segev, (Eds.), Methods in Neuronal Modeling (2nd ed.). Cambridge, MA: MIT Press. Destexhe, A., Bal, T., McCormick, D. A., & Sejnowski, T. J. (1996). Ionic mechanisms underlying synchronized oscillations and propagating waves in a model of ferret thalamic slices. J. Neurophysiol., 76, 2049–2070. Destexhe, A., Mainen, Z., & Sejnowski, T. J. (1994a). An efficient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Computation, 6, 14–18.
514
Alain Destexhe
Destexhe, A., Mainen, Z., & Sejnowski, T. J. (1994b). Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. J. Computational Neurosci., 1, 195–230. Fidzhugh, R. (1961). Impulses and physiological states in models of nerve membrane. Biophys. J., 1, 445–466. Hille, B. (1992). Ionic channels of excitable membranes. Sunderland, MA: Sinauer Associates. Hindmarsh, J. L., & Rose, R. M. (1982). A model of the nerve impulse using two first-order differential equations. Nature, 296, 162–164. Hines, M. (1993). NEURON—A program for simulation of nerve equations. In F. H. Eeckman (Ed.), Neural Systems: Analysis and Modeling (pp. 127–136). Boston: Kluwer Academic Publishers. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London), 117, 500–544. Kepler, T. B., Abbott, L. F., & Marder, E. (1992). Reduction of conductance-based neuron models. Biol. Cybernetics, 66, 381–387. Krinskii, V. I., & Kokoz, Y. M. (1973). Analysis of the equations of excitable membranes—1. Reduction of the Hodgkin-Huxley equations to a secondorder system. Biofizika, 18, 506–511. Llin´as, R. R. (1988). The intrinsic electrophysiological properties of mammalian neurons: A new insight into CNS function. Science, 242, 1654–1664. Lytton, W. W. (1996). Optimizing synaptic conductance calculation for network simulations. Neural Computation, 8, 501–509. Mainen, Z. F. (1995). Mechanisms of spike generation in neocortical neurons. Unpublished doctoral dissertation, University of California, San Diego. Miller, R. M. (1979). A simple model of delay, block and one way conduction in Purkinje fibers. J. Math. Biol., 7, 385–398. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1986). Numerical recipes: The art of scientific computing. Cambridge: Cambridge University Press. Rinzel, J. (1985). Excitation dynamics: Insights from simplified membrane models. Fed. Proc., 44, 2944–2946. Steriade, M., McCormick, D. A., & Sejnowski., T. J. (1993). Thalamocortical oscillations in the sleeping and aroused brain. Science, 262, 679–685. Traub, R. D., & Miles, R. (1991). Neuronal networks of the hippocampus. Cambridge: Cambridge University Press. Vandenberg, C. A., & Bezanilla, F. (1991). A model of sodium channel gating based on single channel, macroscopic ionic, and gating currents in the squid giant axon. Biophys. J., 60, 1511–1533.
Received January 25, 1995; accepted June 21, 1996.
NOTE
Communicated by William Lytton
A Simple Model of Transmitter Release and Facilitation Richard Bertram Pennsylvania State University, Division of Science, Erie, PA 16563 USA
We describe a model of synaptic transmitter release and presynaptic facilitation that is based on activation of release sites by single Ca2+ microdomains. Facilitation is due to Ca2+ that remains bound to release sites between impulses. This model is inherently stochastic, but deterministic equations can be derived for the mean release. The number of equations required to describe the mean release is small, so it is practical to use the model with models of neuronal electrical activity to investigate the effects of different input spike patterns on presynaptic facilitation. We use it in conjunction with a model of dopamine-secreting neurons of the basal ganglia to demonstrate that transmitter release is greater when the neuron bursts than when it spikes continuously, due to the greater facilitation generated by the bursting impulse pattern. Finally, a minimal form of the model is described that is coupled to simple models of postsynaptic receptors and passive membrane to compute the postsynaptic voltage response to a train of presynaptic stimuli. This form of the model is appropriate for neural network simulations.
1 Introduction Several models have been proposed for synaptic transmitter release and presynaptic facilitation, the process whereby impulse-evoked release is enhanced for up to several hundred milliseconds if preceded by one or more conditioning impulses (Magleby, 1987). In the models of Zucker and Fogelson (1986) and Yamada and Zucker (1992), release sites and Ca2+ channels are spatially separated, and facilitation is due primarily to residual free Ca2+ left over from previous impulse-induced Ca2+ channel openings, so the diffusion of free Ca2+ is important. In an alternative model (Stanley, 1986; Bertram, Sherman, & Stanley, 1996), each transmitter release site is colocalized with a single Ca2+ channel. Release is then activated by the microdomain of Ca2+ surrounding the channel mouth, eliminating the need to perform Ca2+ diffusion calculations. This model is intended to describe release evoked by short depolarizations, such as action potentials, during which the opening of multiple Ca2+ channels in the neighborhood of a release site is unlikely. Neural Computation 9, 515–523 (1997)
c 1997 Massachusetts Institute of Technology °
516
Richard Bertram
In the single-domain model, each release site has four independent Ca2+ binding sites or gates, each of which must be bound for release to occur. Kinetics of the gates are graded from slow unbinding, high affinity to fast unbinding, low affinity. Facilitation is due to the accumulation of Ca2+ bound to one or more of the gates rather than to residual free Ca2+ . This model was motivated by the findings that release sites can be activated by the Ca2+ microdomain under a single channel (Stanley, 1993) and that there is a steplike frequency dependence of facilitation associated with a steplike decline of Ca2+ cooperativity (Stanley, 1986). There is also considerable evidence that residual free Ca2+ does not play an important role in facilitation (Stanley, 1986; Blundon, Wright, Brodwick, & Bittner, 1993; Winslow, Duffy, & Charlton, 1994), although it does appear to play a role in other forms of enhancement that last for several seconds (augmentation) or several minutes (posttetanic potentiation) following a high-frequency conditioning tetanus (Delaney, Zucker, & Tank, 1989; Delaney & Tank, 1994). 2 The Mathematical Model The binding of Ca2+ to the jth gate obeys a first-order kinetic scheme: kj+
Bj , Uj + Ca − kj
j = 1, 2, 3, 4,
(2.1)
where Bj and Uj are the normalized (Bj + Uj = 1) concentrations or probabilities of bound and unbound gates and Ca is the concentration of free Ca2+ at the release site. The binding rates (kj+ , in msec−1 µM−1 ) and unbind-
ing rates (kj− , in msec−1 ) were chosen to produce steps in the facilitation
curve: k1+ = 3.75 × 10−3 , k1− = 4 × 10−4 , k2+ = 2.5 × 10−3 , k2− = 1 × 10−3 , k3+ = 5 × 10−4 , k3− = 0.1, k4+ = 7.5 × 10−3 , k4− = 10. The kj− vary from small to
large, resulting in a wide range of unbinding time constants (1/kj− ): 2.5 sec, 1 sec, 10 msec, and 0.1 msec, respectively. The temporal evolution of Bj is described by dBj = −(kj− + kj+ Ca) Bj + kj+ Ca , dt
j = 1, 2, 3, 4.
(2.2)
The opening and closing of a Ca2+ channel is a stochastic process, with transition rates that depend on membrane voltage (V). The quantity Ca is therefore a random variable. In Bertram et al. (1996), a Monte Carlo procedure was used to handle the effects of stochastic channel activity on release. Although effective, this approach is computationally demanding, and a better approach was subsequently developed (Bertram & Sherman, in press). Here, Fokker-Planck-type partial differential equations were derived for the probability distributions of bound gates under open and closed
Transmitter Release and Facilitation
517
Ca2+ channels. Ordinary differential equations for the expected values of these distributions—that is, the mean concentrations of bound gates of each type—were then derived. Although the four gates are physically independent, they are statistically correlated through their common dependence on the domain Ca2+ concentration. Therefore, expected or mean release is not simply the product of the mean bound concentrations of each gate type; it depends on the mean concentrations of all gate configurations, resulting in a system of 30 differential equations. However, it was shown that the correlation among gates is removed and the number of equations necessary to compute release reduced to 4 if it is assumed that the gates respond to the average domain Ca2+ concentration—the domain Ca2+ concentration averaged over the entire population of Ca2+ channels. Using perturbation analysis, it was shown that the error introduced by this approximation is typically small. Taking the expected values of both sides of equation 2.2, and replacing E[Ca Bj ] by E[Ca] E[Bj ] (the average domain Ca2+ approximation), we get dσj = −(kj− + kj+ Ca) σj + kj+ Ca , dt
j = 1, 2, 3, 4,
(2.3)
where σj is the expected value of Bj and Ca = E[Ca] is the average domain Ca2+ concentration, equal to the product of the fraction of open channels (m) and the domain Ca2+ concentration at an open channel (Caopen ). The latter is assumed to be proportional to the Ca2+ influx through an open channel (i(V)): Caopen = −A i(V). Assuming that the Ca2+ channel has a single open and a single closed state, the fraction of open channels satisfies dm = αm (1 − m) − βm m dt
(2.4)
where αm , βm are the channel opening and closing rates, respectively. Expected release is then E[R] = σ1 σ2 σ3 σ4 .
(2.5)
The fourth gate, unlike the others, has fast unbinding kinetics, so that σ4 changes rapidly with changes in Ca. Thus, the number of differential equations needed to compute E[R] is further reduced by one by assuming that σ4 is in equilibrium with Ca.
3 Example: The Dopamine Neuron Our release model can be used with any model of membrane voltage or even with experimental voltage recordings. As an example, we use it in
518
Richard Bertram
conjunction with a model of in vitro electrical activity of the dopamine neuron of the basal ganglia (Li, Bertram, & Rinzel, 1996). A periodic bursting pattern, high-frequency spiking followed by a long period of silence, is produced (see Fig. 1B) in the presence of the glutamate agonist N-methylD-aspartate (NMDA). In the absence of NMDA, the cell spikes continuously at a lower frequency (see Fig. 1A). Mean release evoked by the continuous spiking pattern shows some facilitation (see Fig. 1E), due to the slow growth of σ1 and σ2 (see Fig. 1C). However, facilitation generated by the bursting pattern is much greater (see Fig. 1F), consistent with in vivo experimental findings (Gonon, 1988). In this case σ3 increases during the first portion of the active phase, while σ1 and σ2 increase throughout the active phase (see Fig. 1D), facilitating release within an active phase. Between active phases σ3 decays back to its baseline value, but the burst period is short enough (800 msec) that the active-phase increases in σ1 and σ2 are not entirely lost. Therefore, σ1 and σ2 accumulate over several bursts, and release evoked by later bursts is greater than that evoked by earlier ones. 4 A Minimal Model To demonstrate how the transmitter release model can be used in neural network simulations, it is simplified to a form that can be used with integrateand-fire neurons. This model is coupled to simple models of postsynaptic transmitter receptors and passive membrane to generate a postsynaptic voltage response. Following Destexhe, Mainen, & Sejnowski (1994a) we assume that each impulse leads to the release of a square pulse of transmitter of 1 msec duration. In contrast to Destexhe et al., we assume that the size of the transmitter pulse evoked by each stimulus is not constant. The magnitude of the initial pulse is T(1) , while the magnitude of each subsequent pulse is the product of T(1) and the presynaptic facilitation. Facilitation is computed by solving equation 2.3 for σ1 , σ2 , and σ3 (or just one of the three for the simplest model that exhibits facilitation) assuming that each presynaptic stimulus leads to a square pulse of Ca lasting 1 msec. (With 10 mM external Ca2+ , Ca summed over an impulse is 63 µM, using the Hodgkin-Huxley equations [Hodgkin & Huxley, 1952] to generate the impulse. This value may be used as the magnitude of the Ca pulse, scaled linearly to reflect any changes in the external Ca2+ concentration.) Facilitation contributed by the jth gate is then the ratio of σj at the end of the nth pulse to the value at the end of the first pulse. We assume that the facilitating transmitter pulses act on a simple twostate postsynaptic receptor (Destexhe et al., 1994a), with the fraction of bound receptors (a) given by da = αa T (1 − a) − βa a, dt
(4.1)
Transmitter Release and Facilitation
519
Figure 1: Expected release (see equations 2.3–2.5) evoked by a dopamine neuron model. In the presence of NMDA, the cell generates bursts of action potentials (B); otherwise, it spikes continuously (A). Release is much greater when the cell bursts (E, F; notice the different scales) due to the greater accumulation of bound gates (C, D; σ3 has been multiplied by 30). The singlechannel Ca2+ current is given by a Goldman-Hodgkin-Katz expression, i(V) = 1.45V[Caex /(1 − exp(0.07V))], where Caex = 1 mM is the external Ca2+ concentration. We use A = 0.1 µM fA−1 and αm = 3.1 eV/10 , βm = e−V/26.7 to compute Ca. Both αm and βm are based on squid giant synapse data (Llin´as, Steinberg, & Walton, 1981) modified for the higher temperature at which dopamine neurons typically operate (37◦ F).
520
Richard Bertram
where T is the transmitter concentration and αa and βa are the binding and unbinding rates, respectively. Postsynaptic membrane voltage (Vpost ) is computed with the passive membrane equation dVpost = −(Imem + Isyn ) / Cmem dt
(4.2)
where Cmem is the membrane capacitance, Imem = g¯ mem (V − Vmem ) is the passive membrane current, and Isyn = g¯ syn a (V − Vsyn ) is the synaptic current. Note that since both Ca and T appear as square pulses in equations 2.3 and 4.1, the equations have piecewise exponential solutions (see Destexhe et al., 1994a). In Figure 2 this minimal model is used to compute the postsynaptic voltage response to a short 100-Hz stimulus train. The synapse facilitates throughout the train, producing consistently larger transmitter pulses (see Fig. 2A) and postsynaptic voltage depolarizations (see Fig. 2B). Combined with the relatively slow membrane dynamics, this leads to a substantial rise in the average postsynaptic voltage (see Fig. 2B, solid curve). In contrast, when facilitation is excluded by assuming that the size of each transmitter pulse is the same, the average postsynaptic voltage exhibits only a small rise, associated with the slow membrane dynamics (see Fig. 2B, dashed curve).
5 Discussion Computational models of synaptic transmission typically concentrate on postsynaptic mechanisms, assuming explicitly or implicitly that the amount of neurotransmitter released by each impulse is the same (Rall, 1967; Destexhe et al., 1994a). However, a key feature of the transmission process is that the number of transmitter molecules released by an impluse depends on the history of presynaptic electrical activity. The model discussed here is one of several intended to describe presynaptic transmitter release and shortterm presynaptic enhancement (Zucker & Fogelson, 1986; Parnas, Dudel, & Parneas, 1986; Yamada & Zucker, 1992). In addition to the differences discussed earlier, these other release models differ from the present one in the structure of their release sites. In Zucker and Fogelson (1986), it was assumed that the kinetics of Ca2+ binding to a release site are instantaneous, so that the rate of release is proportional to some power of the Ca2+ concentration at the release site. This model was later modified to include finite kinetic rates (Yamada & Zucker, 1992), but because the modified model includes only one slow Ca2+ binding site, facilitation depends largely on residualfree Ca2+ , contrary to a growing body of experimental data (Stanley, 1986; Blundon et al., 1993; Winslow et al., 1994). In addition, this model does not account for the multiple time scales of facilitation observed experimentally (Stanley, 1986; Magleby, 1987). Finally, the model by Parnas et al. (1986)
Transmitter Release and Facilitation
521
Figure 2: The minimal model of presynaptic facilitation is combined with simple models of postsynaptic binding and passive membrane to describe the postsynaptic voltage response to a short 100-Hz impulse train (see equations 2.3, 4.1, and 4.2). (A) Each stimulus elicits a 1-msec square pulse of transmitter (T). The magnitude of the first pulse is assumed to be 0.1 mM. The magnitude of subsequent pulses grows as the synapse facilitates. (B) The increased magnitude of T results in an increased postsynaptic voltage response with each stimulus, leading to a significant rise in the average postsynaptic voltage (solid line). In the absence of facilitation, the average voltage equilibrates at a lower level (dashed line). Postsynaptic parameter values are: αa = 2 msec−1 mM−1 , βa = 1 msec−1 , g¯ mem = 0.1, g¯ syn = 0.2 (µM cm−2 ), Vmem = −70, Vsyn = 0 (mV), and Cmem = 1 µF cm−2 .
postulates an explicit voltage dependence in the release mechanism, a feature that has not been supported by experimental data (Lando` & Zucker, 1994).
522
Richard Bertram
The present model also belongs to the family of kinetic models used to describe a wide variety of neuronal mechanisms (see Destexhe, Mainen, & Sejnowski, 1994b, for an overview). These include voltage-gated and ligandgated ion channels (Hodgkin & Huxley, 1952; Destexhe et al., 1994b); IP3 sensitive Ca2+ channels in the membrane of the endoplasmic reticulum (De Young & Keizer, 1992); postsynaptic receptors and channels (Standley, Ramsey, & Usherwood, 1993; Destexhe et al., 1994a,b); and presynaptic transmitter release sites (Parnas et al., 1986; Yamada & Zucker, 1992; Destexhe et al., 1994b; Bertram et al., 1996). In Destexhe et al. (1994b) a nonfacilitating kinetic model of transmitter release was coupled to kinetic models of postsynaptic receptor binding to provide a description of the complete synaptic signal transduction process. As demonstrated in Figure 2, the release model discussed in this article can be applied in a similar manner, introducing the key feature of presynaptic facilitation to the signal transduction scheme.
Acknowledgments I thank Arthur Sherman for many helpful discussions and for a careful reading of the manuscript. This work was performed while I was at the Mathematical Research Branch (NIDDK, NIH) in Bethesda, MD.
References Bertram, R., and Sherman, A. (In press). Population dynamics of synaptic release sites, SIAM J. Appl. Math. Bertram, R., Sherman, A., and Stanley, E. F. (1996). Single-domain/bound calcium hypothesis of transmitter release and facilitation. J. Neurophysiol., 75, 1919–1931. Blundon, J. A., Wright, S. N., Brodwick, M. S., and Bittner, G. D. (1993). Residual free calcium is not responsible for facilitation of neurotransmitter release. Proc. Natl. Acad. Sci. USA, 90, 9388–9392. De Young, G., and Keizer, J. (1992). A single pool IP3 -receptor based model for agonist stimulated Ca2+ oscillations. Proc. Natl. Acad. Sci. USA, 89, 9895–9899. Delaney, K. R., and Tank, D. W. (1994). A quantitative measurement of the dependence of short-term synaptic enhancement on presynaptic residual calcium. J. Neurosci., 14, 5885–5902. Delaney, K. R., Zucker, R. S., and Tank, D. W. (1989). Calcium in motor nerve terminals associated with posttetanic potentiation. J. Neurosci., 9, 3558–3567. Destexhe, A., Mainen, Z. F., and Sejnowski, T. J. (1994a). An efficient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Computation, 6, 14–18.
Transmitter Release and Facilitation
523
Destexhe, A., Mainen, Z. F., and Sejnowski, T. J. (1994b). Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. J. Comput. Neurosci., 1, 195–230. Gonon, F. G. (1988). Nonlinear relationship between impulse flow and dopamine release by rat midbrain dopaminergic neurons as studied by in vivo electrochemistry. Neuroscience, 24, 19–28. Hodgkin, A. L., and Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (Lond.), 117, 500–544. Lando, ` L., and Zucker, R. S. (1994). Ca2+ cooperativity in neurosecretion measured using photolabile Ca2+ chelators. J. Neurophysiol., 72, 825–830. Li, Y.-X., Bertram, R., and Rinzel, J. (1996). Modeling N-methyl-D-aspartateinduced bursting in dopamine neurons. Neuroscience, 71, 397–410. Llin´as, R., Steinberg, I. Z., and Walton, K. (1981). Presynaptic calcium currents in squid giant synapse. Biophys. J., 33, 289–322. Magleby, K. (1987). Short-term changes in synaptic efficacy. In G. M. Edelman, L. E. Gall, W. Maxwell, & W. M. Cowan (Eds.), Synaptic Function, (pp. 21–56). New York: Wiley. Parnas, H., Dudel, J., and Parnas, I. (1986). Neurotransmitter release and its facilitation in the crayfish. VII. Another voltage dependent process besides ¨ Ca entry controls the time course of phasic release. Pflugers Arch., 406, 121– 130. Rall, W. (1967). Distinguishing theoretical synaptic potentials computed for different somadendritic distributions of synaptic inputs. J. Neurophysiol., 30, 1138–1168. Standley, C., Ramsey, R. L., and Usherwood, P. N. R. (1993). Gating kinetics of the quisqualate-sensitive glutamate receptor of locust muscle studied using agonist concentration jumps and computer simulations. Biophys. J., 65, 1379– 1386. Stanley, E. F. (1986). Decline in calcium cooperativity as the basis of facilitation at the squid giant synapse. J. Neurosci., 6, 782–789. Stanley, E. F. (1993). Single calcium channels and acetylcholine release at a presynaptic nerve terminal. Neuron, 11, 1007–1011. Winslow, J. L., Duffy, S. N., and Charlton, M. P. (1994). Homosynaptic facilitation of transmitter release in crayfish is not affected by mobile calcium chelators: Implications for the residual ionized calcium hypothesis from electrophysiological and computational analyses. J. Neurophysiol., 72, 1769–1793. Yamada, W. M., and Zucker, R. S. (1992). Time course of transmitter release calculated from simulations of a calcium diffusion model. Biophys. J., 61, 671–682. Zucker, R. S., and Fogelson, A. L. (1986). Relationship between transmitter release and presynaptic calcium influx when calcium enters through discrete channels. Proc. Natl. Acad. Sci. USA, 83, 3032–3036.
Received April 10, 1996; accepted August 9, 1996.
NOTE
Communicated by Klaus Pawelzik
Functional Periodic Intracortical Couplings Induced by Structured Lateral Inhibition in a Linear Cortical Network S. P. Sabatini G. M. Bisio PSPC Research Group, Department of Biophysical and Electronic Engineering, University of Genova, Genova, Italy
L. Raffo Department of Electrical and Electronic Engineering, University of Cagliari, Cagliari, Italy
The spatial organization of cortical axon and dendritic fields could be an interesting structural paradigm to obtain a functional specificity without postulating highly specific feedforward connections. In this article, we investigate the functional implications of recurrent intracortical inhibition when it occurs through clustered medium-range interconnection schemes (Worg ¨ otter ¨ & Koch, 1991; Somogyi, 1989; Kritzer, Cowey, & Somogyi, 1992). Moreover, the interaction between the inhibitory schemes and visual orientation maps is explored. Assuming linearity, we show that clustered inhibitory mechanisms can trigger a propagation process that allows the development of extra (i.e., induced) interactions among the cortical sites involved in the recurrent loops. In addition, we point out how these interactions functionally modify the response of cortical simple cells and yield to highly structured Gabor-like receptive fields. This study should be considered not as a realistic biological model of the primary visual cortex but as an attempt to explain possible computational principles related to intracortical connectivity and to the underlying single-cell properties.
1 The Model Assuming an “averaged model” of the visual cortex (Amari, 1977; Krone, Mallot, Palm, & Schuz, ¨ 1986; Chernjavsky & Moody, 1990), the cortical surface (x = (x1 , x2 )) can be viewed as a two-dimensional homogeneous distribution of cortical elements (nodes) characterized by specific orientation selectivities θ(x). In this context, a node represents a neural population, and interconnections represent functionally the overlap of axonal and dendritic clouds. Under linear steady-state conditions, the distribution of cortical exNeural Computation 9, 525–531 (1997)
c 1997 Massachusetts Institute of Technology °
526
S. P. Sabatini, G. M. Bisio, and L. Raffo
citation e(x) can be modeled as: Z e(x) = e0 (x) − b k f b (x; ξ )e(ξ )dξ Z = e0 (x) − b
k f f (x; ξ ; b)e0 (ξ )dξ,
(1.1)
where e0 (x) is the feedforward distribution of excitation, b is the strength of the inhibition (b > 0), and k f b (x; ξ ) is the intracortical coupling function. The feedforward resolvent kernel of the integral equation k f f (x; ξ ; b) (Winter, 1990) represents an interaction among cortical sites, telling how total afferent drive at one site affects activity at a second site. equation 1.1 can thus be interpreted as a two-dimensional iterative filter that detects specific characteristics present in the input pattern of excitation e0 (x); the spatial extension on which these characteristics are detected is possibly larger than that of k f b (x; ·). This occurs both directly, by physical local interactions, and indirectly, through the propagation property of iterative filters, which asserts that the output values of the filter can be recurrently affected by a larger neighbor region on the cortical plane. In this way, one can speak of “induced” functional couplings not directly related to the presence of the corresponding specific wirings. In the case of a translation-invariant kernel (k f b (x; ξ ) = k f b (x − ξ )), the exact solution of equation 1.1 can be straightforwardly obtained by Fourier transform techniques, and the resulting cortical activity patterns tend to have spatial frequency corresponding to the peak of the function (1 + bK f b (f))−1 , where uppercase letters refer to Fourier transforms and f = ( f1 , f2 ) is the spatial frequency vector. In particular, when inhibition arises from clusters of cells spatially distributed at a distance ±d from the target cell (see Fig. 1a), the spectrum of the feedforward resolvent kernel, K f f = K f b (1 + bK f b )−1 , shrinks, and the energy peak shifts approximately at f∗ = (±1/2d, 0) (see Fig. 1b). This result can be motivated by the following expansion: [1 + bK f b (f)]−1 = 1 − bK f b (f) + b2 K2f b (f) − b3 K3f b (f) + · · · = 1 − bK f f (f).
(1.2)
Since K f b is a low-pass filter, its powers in equation 1.2 lead to a decrease in bandwidth and so to an increasing width of the resolvent coupling function. Specifically, if we consider k f b (x − ξ ) as the sum of two laterally offset ¢ ¡ gaussians (g x1 ± d, x2 ; σk2 ), the powers of its spectrum result, in the space domain, in a sum of shifted gaussians with alternating signs, which contribute to the coupling with decreasing strength: k f f (x − ξ ) =
∞ X (−1)l−1 kl (x − ξ ) l=1
(1.3)
Functional Periodic Intracortical Couplings
527
(a)
(b)
Figure 1: (a) Schematic representation of the recurrent inhibitory kernel (k f b ), with θ = π/2, ϕ0 = θ + π/2, 1ϕ = π/8. The thick vertical line in the center represents the orientation of the target cell, which receives inhibition from the cells in the shaded area. The lateral spread of interconnections can be characterized by the angular bandwidth α related to the half-peak magnitude extension of the kernel. (b) The normalized cross-sections, along the direction ϕ0 , of |K f b (f)| (dashed line) and of |K f f (f)| for b = 0.5 (thick continuous line). Increasing the value of b (see the insets), one can observe a major redistribution of the spectrum energy around the frequency f1 = ±1/2d.
528
S. P. Sabatini, G. M. Bisio, and L. Raffo
where kl (x1 − ξ1 , x2 − ξ2 ) =
l µ ¶ l−1 ³ ´ X l b g x1 −ξ1 −(2m−l)d, x2 −ξ2 ; lσk2 .(1.4) m l m=0
This expression well evidences the regular distribution of the additionally induced couplings and their potential impact on the spatial structure of simple cell-receptive fields. It is worth noting that the geometrical organization of inhibitory couplings has a key role: whereas recurrent inhibitory couplings among clustered populations of neurons induce a regular spatial alternation of excitatory and inhibitory interactions, this behavior does not show up when inhibitory connections are uniformly distributed on the cortical plane. The functional effects of clustered recurrent inhibition are even more interesting when the structure of the interconnection schemes depends on the orientation selectivity of the target cell. In the following, we focus on inhibitory interactions occurring along the axis orthogonal to the receptive field orientation (Gilbert & Wiesel, 1989; Kisv´arday et al., 1994) (see Figs. 2a– d). Accordingly, we consider k f b as: k f b (x; ξ ) = gϕ0 (x; ξ ) + gϕ0 +1ϕ (x; ξ ) + gϕ0 −1ϕ (x; ξ ) ,
(1.5)
where gϕ represents a couple of translated (±d) two-dimensional gaussian functions (with spatial spread σk ), rotated through an angle ϕ(x) about the location of the target cell. Specifically, we choose ϕ0 (x) = θ (x) + π/2 and 1ϕ in the range [0, σk (2 ln 2)1/2 /d], the upper limit giving a maximally flat function k f b . Figures 2e–h show different resultant interaction fields for various orientation maps. Observe that smooth periodic variations in the orientation preferences are decisive to sustain the propagation mechanism and thereby to prime the periodicity of the induced couplings (see Fig. 2e). Random arrangements of orientations, in contrast, as well as high gradients in the orientation map where inhibition is present, result in a loss of coherence and hence in the attenuation or even the suppression of the contributions of recurrence (see Figs. 2f–h). For all these maps, we observe that although the recurrent kernel k f b has a uniform positive sign, the resultant interaction fields k f f ’s show areas with couplings that are negative in sign. The characteristics of these resultant interaction fields (i.e., the size, the shape, and the sign of interaction) have a complex dependence on the pattern of recurrent connections. 2 Effects on the Cortical Receptive Field Profiles Since the striate cortex is visuotopically organized, one might expect a relationship between the morphology of lateral connections and the receptive
Functional Periodic Intracortical Couplings
529
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Figure 2: The white blobs in the top quartet figure show the feedback inhibitory kernels (k f b ) referred to the central cell of four different distributions of orientation selective cells. The orientation fields in (a–c) correspond to portions of a single orientation map (derived after Niebur and Worg ¨ otter’s ¨ model in Niebur & Worg ¨ otter, ¨ 1994) and represent different typical situations, such as an isoorientation domain (a), a pinwheel discontinuity (b), and a fracture discontinuity (c). The map in (d) represents a random arrangement of orientations. The middle quartet figure (e–h) shows the contour plots of the resulting feedforward kernels after recursion still referred to the central cell for the corresponding orientation fields. The plots point out the topology of the inhibitory-excitatory effects of the total afferent drive, on the activity of the central cell: gray lines represent positive values of k f f (i.e., inhibitory contributions), whereas black lines represent negative values of k f f (i.e., excitatory contributions). The bottom quartet (i–l) shows the corresponding receptive fields, h(·, x0 ), obtained by a numerical evaluation of equation 2.3. The parameters used in all the simulations are σk = 0.0625, 1ϕ = π/12, d = 0.25, σh = 0.1, and p = 3. The inhibition strength b is 0.5.
530
S. P. Sabatini, G. M. Bisio, and L. Raffo
field topography (Mallot, von Seelen, & Giannakopoulos, 1990; Schwarz & Bolz, 1991). To model the effect of recurrent inhibition on the linear simple cell receptive fields, we assume the superposition of contributions from the lateral geniculate nucleus (LGN) and intracortical inhibitory interactions. Specifically, functional projections of LGN afferent neurons provide a weak bias in orientation to cortical cells, and the final profiles of the receptive fields are molded by the inhibitory processes inside the cortex. From a computational point of view, assuming the linearity of simple cells (De Angelis, Ohzawa, & Freeman, 1993), the initial excitation e0 (x) is given by Z e0 (x) = h0 (x; x0 )i(x0 )dx0 , (2.1) where i(x) is the input signal and h0 (x; x0 ) is the initial receptive field profile due only to thalamic feedforward connections. The initial receptive field profile is£ modeled as ¢an oriented gaussian function: h0 (x1 ; x2 ) = ¤ ¡ (1/2πpσh2 ) exp − x21 /p2 + x22 /2σh2 , where p refers to the aspect ratio of the field, and σh specifies the receptive field size. Taking into account intracortical contributions, the resulting excitation can be expressed as Z (2.2) e(x) = h(x; x0 )i(x0 )dx0 . From equation 2.1, we can derive the expression for the resulting receptive field after inhibition: Z (2.3) h(x; x0 ) = h0 (x; x0 ) − b k f f (x; ξ ; b)h0 (ξ ; x0 )dξ. By a numerical evaluation of equation 2.3, we obtained the resultant receptive field profiles for the four orientation fields considered (see Figs. 2i–l). We note that horizontal inhibition may well be the substrate for a variety of influences observed between the receptive field center and its surround, including the refining of the orientation and spatial frequency tuning of simple cells, and the emergence of the Gabor-like receptive field profiles (Sabatini, 1996). Equation 2.3 indicates how the receptive fields, interpreted as retinotopic distributions of the activty on the cortical plane, reflect the whole spatial pattern of couplings and not only the direct neuroanatomical connectivity (von Seelen, Mallot, & Giannakopoulos, 1987). Therefore, asymmetry and anisotropy of lateral intracortical couplings manifest themselves also in their visuotopic representations, influencing the receptive fields of the cells exposed to the interconnection patterns (Martin & Whitteridge, 1984; Gilbert & Wiesel, 1989). References Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biol. Cybern., 27, 77–87.
Functional Periodic Intracortical Couplings
531
Chernjavsky, A., & Moody, J. (1990). Spontaneous development of modularity in simple cortical models. Neural Computation, 2, 334–354. De Angelis, G. C., Ohzawa, I., & Freeman, R. D. (1993). Spatiotemporal organization of simple-cell receptive fields in the cat’s striate cortex. II. Linearity of temporal and spatial summation. J. Neurophysiol., 69, 1118–1135. Gilbert, C. D., & Wiesel, T. N. (1989). Columnar specificity of intrinsic horizontal and cortico-cortical connections in cat visual cortex. J. Neuroscience, 9, 2432– 2442. Kisv´arday, Z. F., Kim, D.-S., Eysel, U. T., & Bonhoffer, T. (1994). Relationship between lateral inhibitory connections and the topography of the orientation map in cat visual cortex. Eur. J. Neurosci., 6, 1619–1632. Kritzer, M. F., Cowey, A., & Somogyi, P. (1992). Patterns of inter- and intralaminar GABAergic connections distinguish striate (V1) and extrastriate (V2, V4) visual cortices and their functionally specialized subdivisions in the rhesus monkey. J. Neuroscience, 12(11), 4545–4564. Krone, G., Mallot, H., Palm, G., & Schuz, ¨ A. (1986). Spatiotemporal receptive fields: A dynamical model derived from cortical architectonics. Proc. R. Soc. London Biol., 226, 421–444. Mallot, H. A., von Seelen, W., & Giannakopoulos, F. (1990). Neural mapping and space variant image processing. Neural Networks, 3, 245–263. Martin, K. A. C., & Whitteridge, D. (1984). The relationship of receptive field properties to the dendritic shape of neurones in the cat striate cortex. J. Physiol., 356, 291–302. Niebur, E., & Worg ¨ otter, ¨ F. (1994). Design principles of columnar organization in visual cortex. Neural Computation, 6, 602–614. Sabatini, S. P. (1996). Recurrent inhibition and clustered connectivity as a basis for Gabor-like receptive fields in the visual cortex. Biol. Cybern., 74(3), 189– 202. Somogyi, P. (1989). Synaptic organization of GABAergic neurons and GABAA receptors in the lateral geniculate nucleus and visual cortex. In D. Man-Kit Lam and C. D. Gilbert (Eds.), Neural Mechanisms of Visual Perception (pp. 35– 62) Houston: Portfolio. von Seelen, W., Mallot, H. A., & Giannakopoulos, F. (1987). Characteristics of neuronal systems in the visual cortex. Biol. Cybern., 56, 37–49. Winter, D. F. (1990). Integral equations. In C. E. Pearson (Ed.), Handbook of Applied Mathematics (pp. 512–570). New York: Van Nostrand Reinhold. Worg ¨ otter, ¨ F., & Koch, C. (1991). A detailed model of the primary vision pathway in the cat: Comparison of afferrent excitatory and intracortical inhibitory connection schemes for orientation selectivity. J. Neuroscience, 11, 1959–1979.
Received January 4, 1996; accepted August 2, 1996.
Communicated by Geoffrey Goodhill
Possible Roles of Spontaneous Waves and Dendritic Growth for Retinal Receptive Field Development Pierre-Yves Burgi Norberto M. Grzywacz Smith-Kettlewell Eye Research Institute, San Francisco, CA 94115 USA
Several models of cortical development postulate that a Hebbian process fed by spontaneous activity amplifies orientation biases occurring randomly in early wiring, to form orientation selectivity. These models are not applicable to the development of retinal orientation selectivity, since they neglect the polarization of the retina’s poorly branched early dendritic trees and the wavelike organization of the retina’s early noise. There is now evidence that dendritic polarization and spontaneous waves are key in the development of retinal receptive fields. When models of cortical development are modified to take these factors into account, one obtains a model of retinal development in which early dendritic polarization is the seed of orientation selectivity, while the spatial extent of spontaneous waves controls the spatial profile of receptive fields and their tendency to be isotropic.
1 Introduction Retinal cells display some functionally mature receptive fields as soon as light responses can be recorded, even prior to birth or eye opening (Masland, 1977; Rusoff & Dubin, 1977; Dacheux & Miller, 1981a, 1981b; Tootle, 1993; Sernagor & Grzywacz, 1995a). In some species, early ganglion cells are even selective to the orientation of image features and the direction of their motion (Masland, 1977; Sernagor & Grzywacz, 1995a). These selectivities may in some sense be genetically specified. However, as it was well expressed by Jacobson (1991), “How essentially the same gene products form a brain [structure] at one position and a spinal cord [another structure] at another place is not given directly by the genome, but eventuates from a network of regulatory epigenetic processes.” When applied to the retina, this reasoning leads to an interest in epigenetic processes that contribute to the retina’s receptive field self-organization. Three observations may be key to the understanding of the development of retinal orientation selectivity. First, in the retina, complex receptive field properties such as orientation selectivity seem to depend on a single synaptic Neural Computation 9, 533–553 (1997)
c 1997 Massachusetts Institute of Technology °
534
Pierre-Yves Burgi and Norberto M. Grzywacz
layer (the inner plexiform layer, IPL) (Dowling, 1987). Second, developing retinas are characterized by early polarization of dendritic trees (Maslim, Webster, & Stone, 1986; Ramoa, Campbell, & Shatz, 1988; Dunlop, 1990; Vanselow, Dutting, & Thanos, 1990), which seems to impart anisotropies to immature receptive fields (Sernagor & Grzywacz, 1995a). Third, multielectrode and optical recording studies in developing retinas of cats and ferrets have revealed that waves of neural activity propagate in random directions across the ganglion cell surface and the IPL (Meister, Wong, Baylor, & Shatz, 1991; Wong, Meister, & Shatz, 1993), correlating the activity of neighbor amacrine and ganglion cells (Wong, Chernjavsky, Smith, & Shatz, 1995). Such a correlation in retinal neighbor cells has also been observed in other species (in rat, Maffei and Galli-Resta, 1990; in turtle, Sernagor & Grzywacz, 1993). It has been shown that self-organizing multilayer networks with random uncorrelated noise in the first layer, retinotopic connections (with random errors) from layer to layer, and Hebb-like rules for synaptic maturation can lead to the emergence of orientationally selective cells with similar properties to those in the cortex (von der Malsburg, 1973; Linsker, 1986; Yuille, Kammen, & Cohen, 1989; Miller, 1994). We now report the results of substituting retinal-like connections and noise for those used to simulate cortical development. We demonstrate that when Linsker’s (1986) model of self-organization of receptive fields is modified to incorporate spontaneous waves (see Fig. 1A) and early dendritic polarization (see Fig. 1B), the spatial extent of the former controls the spatial profile of receptive fields and their tendency to be isotropic and the latter introduces a bias that helps the development of orientation selectivity. (The material described in this article has appeared in abstract form as Burgi & Grzywacz, 1994a.)
2 Modification of Linsker’s Model to Include Waves and Polarized Dendrites Linsker’s model (1986) consists of a multilayer network with feedforward connections. Following MacKay and Miller’s analysis of Linsker’s model (1990), we consider two layers (besides, possibly, an input layer). The density of synapses from cells in layer A (for amacrine, in our later discussion) to a given cell in the next ascending layer G (for ganglion) assumes a gaussian distribution as function of the position of the cells in layer A. Variations in synaptic strength during development are described by a quadratic Hebbtype rule. For a given set of activities in layer A, such as determined by spontaneous uncorrelated noise or waves, this rule expresses the variations in synaptic weight ωi of synapse i as a function of its activity FiA and the overall activity FG of the postsynaptic cell. Synaptic maturation is described
Spontaneous Waves and Dendritic Growth
535
Figure 1: Two models of the developing inner plexiform layer (IPL). The ganglion cell dendritic tree, represented as a hatched circle, receives synaptic connections from (A) “dendritic-less” amacrine cells (small circles) or (B) amacrine dendrites (straight lines ending on black squares, which represent the synapses). The position of the synapses is random and follows a gaussian distribution concentric with the ganglion cell’s dendritic tree. The dendrodendritic amacrine processes are assumed to originate at the amacrine cell’s soma and point toward the ganglion cell’s soma. Waves of activity, one of which is shown propagating from top left to bottom right, are modeled by an infinitely extended straight wave front with a gaussian profile of depolarization perpendicular to the direction of propagation. The waves propagate through the IPL and correlate the activity of neighbor cells. This correlation depends on the distance separating two cells, the waves’ direction of propagation, and the orientation of the dendritic processes.
536
Pierre-Yves Burgi and Norberto M. Grzywacz
by the following dynamical equation (Linsker, 1986), X d ωi = c1 + (QijA + c2 ) Dj ωj , dt j
(2.1)
where QijA = (FiA − FA )(FjA − FA ) is the covariance matrix (the overbars deA A note average and FA = FiA = FjA ), Dj is synaptic density, c1 = k−FG 0 (F −F0 ), A and c2 = FA (FA − F0A ), with FG 0 , F0 , and k being constants of the Hebbian rule. In this and next sections, we consider the case c1 = c2 = 0 to compare the outcomes of equation 2.1 for different covariance matrices. In section 4 we analyze the behavior of the system in the c1 − c2 plane. For the original Linsker model, the noise in the layer preceding layer A is uncorrelated and random, and the arbor of connections between these layers is gaussian. Therefore, the covariance function Qu in layer A has a gaussian form (Linsker, 1986). We define σu to denote its spatial standard deviation. For comparison with the Linsker model, we calculated what the covariance function would be if the noise assumed the form of spontaneous waves in layer A instead of uncorrelated noise (see the appendix for details). We recently proposed a biophysical model for the generation and propagation of spontaneous waves of activity in the developing retina (Burgi & Grzywacz, 1994b, 1994c), based on pharmacological results (Sernagor & Grzywacz, 1993). As an approximation to the waves in the model, they were modeled here by an infinitely extended straight wave front with a gaussian profile of depolarization along the direction of propagation. In this case, by assuming the synaptic response at each instant to be proportional to the wave-induced depolarization on the amacrine cell, one obtains in the continuum limit (for details, see the appendix),
Qω (Er, Er0 ) = e−1
2
/8σω2
I0 (12 /8σω2 ),
(2.2)
where Er and Er0 are two vectors originating at the target cell in layer G (for us, a ganglion cell) and ending at two cells in layer A (for us, amacrine cells), 1 = |Er − Er0 |, σω denotes the spatial standard deviation of the waves, and I0 (x) is the zero-order Bessel function of an imaginary argument. Finally, we obtained the covariance function for the case of waves exciting the ganglion cell via polarized amacrine dendrites. The dendrite was modeled as a straight process, with a synapse at one ending of this process contacting the ganglion cell in a random position in this cell’s dendritic tree (see Fig. 1B). For simplicity, an arrow beginning at the middle of the amacrine dendrite and ending at its synapse always pointed toward the center of the ganglion cell’s tree. The synaptic response at each instant was taken to be proportional to the integral of the wave-induced depolarization
Spontaneous Waves and Dendritic Growth
537
along the length of the dendrite. The covariance function was computed as the mean covariance due to waves moving in all possible directions. This function is (see the appendix for details), 0
Qd (Er, Er ) =
Z
|Er|+L Z |Er0 |+L |Er0 |
|Er|
Ã
120 00 exp − x x2 8σω
!
à I0
12x0 x00 8σω2
! dx0 dx00 ,
(2.3)
where Er and Er0 are vectors originating at the ganglion cell soma and ending at amacrine synapses, 1x0 x00 = |x0 uEr − x00 uE0r |, with uEr and uE0r being unit vectors in the direction of Er and Er0 , respectively, and L is the length of the amacrine dendrite. For the three covariance functions, we assumed the synaptic density function to the ganglion cell to be a gaussian distribution of standard deviation σs , that is, 0 2
D(Er 0 ) = e−|Er |
/2σs2
.
(2.4)
This distribution reflects the dendritic arbor of the ganglion cell. 3 Eigenvector Analysis To determine what kinds of receptive field emerge from these models, one must solve equation 2.1. For this purpose, the covariance functions were first discretized on square lattices (size 17 × 17). Next, to obtain all possible outcomes of equation 2.1, we calculated the eigenvalues and eigenvectors of matrix (QijA +c2 )Dj (the general solution is a linear combination of the fun-
damental solution set {eλ1 t uE1 , eλ2 t uE2 , · · · , eλn t uEn }, where λi is the eigenvalue corresponding to the eigenvector uEi and the coefficients of the linear combination are determined by the initial conditions, plus a term corresponding to a particular solution of the nonhomogeneous equation). Inspection of equations 2.2–2.4, as well as Linsker’s covariance function, reveals that the parameter space involves only the spatial spread of the arbor to the amacrine layer in Linsker’s model (σu ), the wave’s width (σω ), the dendritic length (L), and the synaptic density spread (σs ). Because these parameters have the dimensions of space, it is possible to study the general behavior of the solutions of these equations with only three dimensionless parameters, by scaling σu , σω , and L by σs , the constant in the three models. To allow a direct comparison between Linsker’s model and our two-layer models, we assumed σu = σω = σ . Hence, the only free parameters left to explore were σ/σs and L/σs . To determine the eigenvalues and eigenvectors that solve equation 2.1, 1/2 we applied the transformation tj = Dj ωj , which symmetrizes the ma-
538
Pierre-Yves Burgi and Norberto M. Grzywacz
trix QijA Dj (MacKay & Miller, 1990). The symmetrized matrix was then transformed into a tridiagonal matrix using Householder reduction, and the eigenvalues and eigenvectors were obtained using the QL algorithm (Wilkinson & Reinsh, 1971). If one assumes (see also MacKay & Miller, 1990) that the synaptic weights must stop growing at some point, then the final receptive fields will tend to be dominated by the eigenvectors with the highest eigenvalues. This domination would not be absolute; variations among cells will occur because of initial random variations in wiring. Initial wiring and termination of development are complex processes, which may depend, for example, on transient gene expression of particular proteins or immature morphologies of dendritic trees. Consequently, to prevent obscuring the main points of the article, we decided to omit the detailed mathematical and computational analyses of the outcome given various initial conditions and termination rules. Accordingly, our results, expressed in terms of eigenvalues and eigenvectors, should be interpreted only as tendencies for symmetry breaking of the different covariance functions, and not as final incidences of various cell-types. The principal eigenfunctions in all three models (for c1 = c2 = 0) were found to be circularly symmetrical, and symmetry breaking could occur only if the next eigenfunction in line dominated by random chance (see Fig. 2). To measure the affinity of the network for symmetry breaking, we considered the ratio between the largest and second-largest eigenvalues. This ratio was high for Qω (in comparison to Linsker’s original model; see Fig. 2A). The ratio became even higher as we widened the waves. The physical reason for this tendency is that wide waves tend to correlate the activity of many amacrine cells, thus averaging out local statistical orientational biases that might exist by chance in the retinal wiring. The ratio between the two largest eigenvalues was smaller when dendrites were present than when they were not. Moreover, the symmetrical center-surround eigenvector that was in the second position for Qu and Qω was relegated to the fourth position for Qd , behind two anisotropic eigenfunctions (see Fig. 2C). Taken together, these two changes indicate that dendritic polarization biases the system toward symmetry breaking, facilitating the formation of orientationally selective cells. Although the three sets of eigenfunctions corresponding to Qu , Qω , and Qd resemble each other visually (see Fig. 2), they are not identical. For Linsker’s case, the eigenfunctions of Qu have been derived analytically by MacKay and Miller (1990). Although we could not do so for Qω and Qd due to the Bessel function, we could verify that the analytic expressions of their eigenfunctions differ from those of Qu . It is possible to see these differences reflected in the eigenfunctions. For instance, the main eigenfunctions are flatter for waves (the larger central white area) than for uncorrelated noise. Similarly, changes between positive and negative regions of the other
Spontaneous Waves and Dendritic Growth
539
Figure 2: Principal eigenfunctions of the covariance functions. The three cases illustrated in this figure correspond to the original Linsker model with uncorrelated noise (A), to a modification of that model, substituting spontaneous waves for uncorrelated noise (B), and to a modification of that model using both waves and polarized dendritic inputs (C). In each column, the eigenfunctions have the same eigenvalue, the column with the largest eigenvalue being at the left. The ratio between every two successive eigenvalues is indicated by the number between every two columns. The gray-scale bar is proportional to synaptic strength. Each eigenfunction has been normalized to use the full gray scale. Parameters were σω = 20 µm (Burgi & Grzywacz, 1994c), σu = 20 µm, σs = 50 µm, and L = 60 µm (Kolb, 1982). While the left-most eigenfunctions represent concentric receptive fields that will be favored by the model, the next receptive fields in line are orientationally selective. Other simulations with σω , σs , and L ranges of 12–100 µm, 10–80 µm, and 25–200 µm, respectively, were performed with similar results. In comparison to development with waves only, development with dendritic polarization helps symmetry breaking. Continued following page.
eigenfunctions are faster for uncorrelated noise than for waves. The wave eigenfunctions are flatter because of the “infinite” extent of the waves’ fronts, which coordinates cells over large distances. How variations in σω /σs and amacrine dendritic length affect the tendency for symmetric eigenfunctions (estimated by λ1 /λ2 ) is shown in Figure 3. Large σω /σs , corresponding to either wide waves or narrow synaptic density distributions, were found to favor concentric cells (see Fig. 3A). This
540
Pierre-Yves Burgi and Norberto M. Grzywacz
Figure 2: Continued.
result is consistent with wide waves (wide with respect to σs ), tending to synchronize the activity of large groups of cells whatever the wave’s directions of propagation. A similar interpretation can be given for uncorrelated random noise where a wide arbor of connections between layer A and the layer preceding it smoothes out local statistical biases for orientation selectivity present in the wiring, leading to more symmetry (see Fig. 3A). The effect of dendritic length on symmetry breaking is shown in Figure 3B. The value of λ1 /λ2 decreases as dendritic length increases, with the largest variations for lengths smaller or comparable to the spatial spread of the waves. Therefore, larger dendritic lengths help symmetry breaking. For very small dendritic lengths, λ1 /λ2 becomes equivalent to the case without dendrites.
Spontaneous Waves and Dendritic Growth
541
Figure 3: Effects of parameter variations on symmetry breaking. (A) The tendency for isotropy expressed as λ1 /λ2 is plotted against the spatial spread (σu ) of the arbor at the input to the amacrine cells (for the original Linsker model) or the spatial spread (σω ) of the waves along the direction of their propagation (for the models substituting waves for uncorrelated noise). These plots are either in the presence or absence of polarized dendritic inputs (L/σs = 1.2) to the ganglion cell. (B) The ratio λ1 /λ2 is plotted against dendritic length for σω /σs = 0.4. The results show that symmetry breaking becomes more likely as the size of the amacrine dendrites grows and as the spreads of the waves and arbor at the input to the amacrine cells diminish.
542
Pierre-Yves Burgi and Norberto M. Grzywacz
Finally, how variations in the waves’ width and amacrine dendritic length affect the shape of the eigenfunctions is illustrated in Figure 4. For waves of spatial spread four times wider than those used in Figure 2B, we found the same set of eigenfunctions, but more spatially extended (roughly 2.5 times more extended; see Fig. 4A). Consequently, the spatial extent of the waves may be important not only for controlling the tendency of formation of concentric receptive fields, but also may contribute to their spatial extent. (In support, there is now evidence that spontaneous waves are key to the determination of receptive field size during retinal development; Sernagor & Grzywacz, 1994, 1995a, 1995b, 1996.) In turn, although increases in receptive field size were small when dendritic length was doubled (see Fig. 4B), the order of importance of the eigenfunctions changed. The center-surround eigenfunction was removed from the set of the first six eigenfunctions (see Fig. 2C), being replaced by three-angular-nodes anisotropic eigenfunctions reminiscent of immature receptive fields with multi-axes anisotropy as observed in embryonic turtle retina (Sernagor & Grzywacz, 1995a). 4 Analysis of Behavior in the c1 − c2 Plane The parameters c1 and c2 depend on the choice of the Hebbian thresholds (FG 0 and F0A ) and the average activity in layer A (FA ). (In our models, this average activity is fixed by the waves’ properties, that is, their spatial extent, speed, and frequency of occurrence.) While parameter c2 affects the eigenvalues of the matrix (QijA + c2 )Dj , parameter c1 moves the fixed point of the dynamics with respect to the origin. Because QijA = (FiA − FA )(FjA − FA ), the entries of
this matrix depend on both its spatial properties (determined by i − j) and the overall level of activity (determined by FA ). But although the effects of the spatial properties of QijA are the central theme of this article, the square of
the overall level of activity, which is proportional to the entries of QijA , has the same dimensions as c2 . Therefore, it is possible to study the general behavior of the eigenvalues of (QijA + c2 )Dj with a single dimensionless parameter by scaling c2 by the maximum of QijA . Because this scaling is equivalent to normalizing the maximum of QijA (QiiA ) to 1, in effect, we will be considering the correlation matrices instead of the covariance ones. The parameter c1 does not affect the eigenvalues of (QijA + c2 )Dj . Rather, it
affects the coefficients of the linear combination of {eλ1 t uE1 , eλ2 t uE2 , . . . , eλn t uEn } in the solution of equation 2.1. The effect on these coefficients is such that if the magnitude of c1 is large, then it gives a significant head start to the all-excitatory (all-inhibitory) profiles. Feng, Pan, and Roychowdhury (1995) analyzed how this head start would affect the outcome of the evolution of equation 2.1 if the synaptic weights were constrained to lie between fixed
Spontaneous Waves and Dendritic Growth
543
Figure 4: Effects of parameter variations on eigenfunctions. Parameters are the same as in Figure 2, except that in (A) waves’ width is 80 µm and in (B), amacrine dendritic length is 120 µm. Comparison of A with Figure 2B shows that the width of the receptive field grows with the width of the waves. Comparison of B with Figure 2C shows that the width of the receptive field grows with the length of the amacrine dendrites.
544
Pierre-Yves Burgi and Norberto M. Grzywacz
minimum and maximum values. These authors reached the conclusion that the stable fixed-point solutions to the equation would lie in four domains where receptive-field profiles are (1) all excitatory, (2) all inhibitory, (3) all excitatory or all inhibitory, or (4) of various shapes, including anisotropic ones. The first two domains occur when c1 is large and the latter two when it is close to zero. It is possible that large c1 are relevant for retinas that have a majority of concentric ganglion cells like the primate and cat retinas. (In section 5, we will address how surround inhibition may be added to allexcitatory profiles.) However, if c1 is large, there is nothing more to discuss; receptive fields will be isotropic. Consequently, in the rest of this section, we will focus on the cases c2 > 0 and c2 < 0 for c1 = 0. The Case c2 > 0. As stated by MacKay and Miller (1990), the eigenvectors affected by c2 are those with nonzero DC components, that is, the allexcitatory (all-inhibitory), and the center-surround connection patterns (the strongest and weakest profiles in Fig. 2C, respectively). For c2 large and positive, MacKay and Miller (1990) have observed that QijA +c2 can be treated as
a first-order perturbation to the matrix c2 J, where J is the matrix Jij = 1. This
matrix has only one nonzero DC eigenvector with positive eigenvalue: the vector nE , ni = 1, which is symmetric. Consequently, the expectation is that as c2 increases, so does the tendency to develop symmetric receptive fields. A mathematical proof (MacKay & Miller, 1990) and computer simulations (see Fig. 5) confirm this expectation for the three cases we are considering. When c2 increases from zero to a large and positive value, we find that the eigenvalues (tendency) of both nonzero-DC, symmetrical profiles rise in comparison to those of the non-DC, orientationally selective. In addition, the simulations show that the tendency is for symmetrical profiles to rise faster for the uncorrelated-noise case than for the spontaneous-waves cases. Mathematically, this result follows from the Bessel function in equations 2.2 and 2.3, being a slow decreasing function and thus perturbating more c2 J than a pure gaussian function. Physically, this result is probably a consequence of the anisotropic shape of the waves imparting more orientation selectivity onto the cells. We make this conjecture because as one widens the waves and makes them more isotropic, the receptive fields themselves become more isotropic (see Figs. 3A and 4A). The Case c2 < 0. For c2 large and negative, when one performs the perturbation analysis, one finds that c2 J has anisotropic eigenvectors with eigenvalues equal to zero and that its only nonzero-DC, symmetric eigenvector, nE , has a negative eigenvalue, which is proportional to c2 . As a result, the eigenvalue of the all-excitatory (all-inhibitory) eigenvector becomes negative, making this receptive field essentially irrelevant (MacKay & Miller, 1990). Computer simulations also reveal that the eigenvalue of the centersurround eigenvector increases (and that of all other eigenvectors with
Spontaneous Waves and Dendritic Growth
545
Figure 5: Effects of the parameter c2 on eigenfunctions. All-excitatory (allinhibitory) profiles are schematically represented by circles; center-surround by black annuli, one-axis anisotropy (second column of Fig. 4B) by half-white–halfblack disks, and two-axis anisotropy (third column of Fig. 4B) by disks bisected by lines. The ordering of the profiles is like those in Figures 2 and 4, except the receptive fields here are arranged vertically, with the most important ones at the top. The parameters σu , σω , σs , and L are the same as in Figure 2. For large |c2 |, and respectively to waves, value of uncorrelated noise favors symmetrical receptive fields in comparison to spontaneous waves. In addition, this figure illustrates that the bias toward symmetry breaking resulting from dendritic polarization is independent of the choice of c2 .
546
Pierre-Yves Burgi and Norberto M. Grzywacz
nonzero-DC components) as c2 decreases from zero to a large and negative value. This result was first observed by MacKay and Miller (1990), who pointed out in their Theorem 2 that as c2 varies, the eigenvalues of the nonzero-DC eigenvectors “increase continuously and monotonically with (c2 ) between asymptotic limits such that the upper limit of one eigenvalue is the lower limit of the eigenvalue above.” Thus, as c2 → −∞, the eigenvalue of the all-excitatory (all-inhibitory) profile falls toward the upper limit of the center-surround profile; similarly, the eigenvalue of the center-surround profile also falls. However, as the original all-excitatory (all-inhibitory) profile falls, it is transformed continuously into a center-surround profile, making it appear like the eigenvalue of the center-surround eigenvector increases (MacKay and Miller 1990). Qualitatively the fall of the all-excitatory (all-inhibitory) profile and apparent rise of the center-surround profile as c2 becomes negative occur for all three covariance matrices (see Fig. 5). But for Qω and Qd , the rise of the center-surround eigenvector is slower than for Qu . As before, the mathematical explanation for this difference stems from the wider spatial extent of Qω and Qd resulting from their Bessel-function component. Again, physically, Qω and Qd ’s preference for anisotropic profiles may be due to the anisotropic shape of the waves. 5 Discussion 5.1 Possible Roles for Spontaneous Waves of Activity. Previous models for self-organization of orientation selectivity in cortex assumed uncorrelated random noise on the first layer of the network (von der Malsburg, 1973; Linsker, 1986; Yuille et al., 1989; Stetter, Lang, & Muller, ¨ 1993). This assumption is not applicable to the developing retina because there is evidence for waves of spontaneous discharges propagating across the retinal surface and, in the process, correlating the activity of neighboring cells (in cat and ferret: Meister et al., 1991; Wong et al., 1993, 1995; in rat: Maffei & Galli-Resta, 1990; in turtle, Sernagor & Grzywacz, 1993, 1995a). This correlation is observed during the period of retinal synaptogenesis (Maslim & Stone, 1986; Horsburgh & Sefton, 1987; De Juan, Grzywacz, Guardiola, & Sernagor, 1996), and dendritic growth and remodeling (Wong, Hermann, & Shatz, 1991; De Juan, Grzywacz, Guardiola, & Sernagor, 1995). Consequently, one must ponder on the role that these spontaneous correlating waves of activity may have in the formation of retinal receptive fields. We modified one of the cortical models (Linsker’s) to include waves and compared the tendencies for formation of particular types of receptive fields in the new and old models through an analysis of their covariance matrices. In this analysis, the likely types of receptive fields were described by these matrices’ eigenvectors with highest eigenvalues, and the tendency of each type of receptive field to develop was estimated by the eigenvalues. The main eigenfunctions of the Linsker and wave-modified covariance matrices were found to have similar profiles. These profiles were purely excita-
Spontaneous Waves and Dendritic Growth
547
tory concentric, orientationally selective, and excitatory-center–inhibitorysurround concentric. Wide waves tend to favor concentric receptive fields more than narrow waves (see Fig. 3A). The physical reason for the isotropy tendency of wide waves in the model is that their wide extent tends to correlate the activity of many amacrine cells, thus averaging out local statistical orientational biases that might exist by chance in the retinal wiring. Whether this averaging really favors isotropy may be debatable, since averaging also weakens the tendency for the center-surround profile (see Fig. 4A). In Figure 3, we used the all-excitatory profile to quantify isotropy, because in the retina, surround inhibition appears early in development, even before mature concentric or orientationally selective receptive fields emerge (Sernagor & Grzywacz, 1995a). This suggests that perhaps the inhibitory surround of center-surround profiles might be determined genetically, independent of spontaneous activity. If this suggestion is incorrect and the emergence of surround depends on spontaneous activity, then c2 must be negative to eliminate the all-excitatory profile (see Fig. 5). In this case, that the early retinal noise comes in the form of waves would favor the formation of orientation selectivity. As discussed in section 4, this bias toward anisotropic profiles may be due to the anisotropic shape of the waves. Another possible role for waves is contributing to the size of receptive fields. Waves tend to yield wider receptive fields than uncorrelated noise because of the large lateral extent of the wave fronts. In addition, wider waves tend to give rise to wider receptive fields (see Fig. 4A). This coupling between wave width and receptive field size could have implications for explaining receptive field size differences across species. Recent data are strongly supportive of the role of waves for both the formation of concentric receptive fields and control of their size. Normally, in turtle, the wavelike activity lasts until about postnatal day 21 (P21) and then disappears (Sernagor & Grzywacz, 1995a). Coincidentally with the disappearance of waves, the receptive field sizes mature (Sernagor & Grzywacz, 1995a). However, if one dark-rears the turtles, the wavelike activity is stronger and longer lasting (>P40), and the receptive fields become larger and more concentric (Sernagor & Grzywacz, 1994). Furthermore, the density of growth cones (and thus of dendritic growth) increases under dark rearing (De Juan et al., 1995). In turn, if one blocks the waves chronically by implanting in the retina curare-soaked Elvax (Elvax is an ethylene vinyl-acetate copolymer) (Sernagor & Grzywacz, 1995b, 1996), the receptive fields stop growing and become less concentric. Although there are other explanations for the changes in receptive field size and concentricity with dark rearing and curare implantation, these changes are generally consistent with our model. 5.2 Possible Roles for Early Dendritic Polarization. To mimic the polarization of poorly branched dendritic trees of immature cells (Maslim et al., 1986; Ramoa et al., 1988; Dunlop, 1990; Vanselow et al., 1990), we modified the original Linsker model to include (waves and) amacrine dendritic
548
Pierre-Yves Burgi and Norberto M. Grzywacz
processes pointing toward the ganglion cell’s soma and making contact with its dendritic tree. We then applied the covariance-matrix analysis to this case. The ratio between the two largest eigenvalues of the resulting covariance matrix was found to be smaller than for the model with only waves (see Fig. 3A), and this effect was more pronounced for longer dendrites (see Fig. 3B). Moreover, for long dendrites, the eigenfunction order changed as the center-surround concentric eigenfunctions were relegated to a less dominant position than anisotropic ones (see Figs. 2C and 4B). These results indicate that dendritic polarization biases the system toward orientation selectivity. Such a bias was also suggested by Sernagor and Grzywacz (1995b), who found physiological correlates of the dendritic polarization in immature retinas. Appendix This appendix shows the main steps for the derivation of equations 2.2 and 2.3. We want to determine how the activity of two amacrine cells is correlated by spontaneous waves propagating through the amacrine layer. Waves were modeled by an infinitely extended straight wave front and a gaussian profile (of standard deviation σω ) along the direction of propagation. The synaptic response at each instant was taken to be proportional to the value ¢ ¡ of the wave on the amacrine cell, that is, FA (Er) ∼ exp −(Er · uEω + x)2 /2σω2 , where Er is the vector originating at the ganglion cell soma and ending at the amacrine cell, uEω is the unit vector in the direction of the wave’s propagation, and x is the shortest distance between the ganglion cell soma and the wave’s crest (see Fig. 6). The correlation between two amacrine cells can thus be expressed as Q(Er, Er0 ) = FA (Er)FA (Er0 ), where the overbar denotes average over a large number of waves and the prime refers to another amacrine cell. The correlation was computed as the mean correlation due to waves moving in all possible directions and spatial locations, Z ∞Z π FA (Er)FA (Er0 )dx dθ, (A.1) Q(Er, Er0 ) = −∞
−π
where we are omitting the normalization factor and other constants from the analysis as they only scale the eigenvectors. Developing equation A.1, and taking into account the waves’ symmetry, we get µ ¶ Z ∞Z π (|Er| cos(α − θ ) + x)2 0 exp − Q(Er, Er ) = 2σω2 −∞ −π ¶ µ (|Er0 | cos(α 0 − θ ) + x)2 dxdθ, (A.2) × exp − 2σω2 where α and α 0 are the vector orientations of Er and Er0 , respectively, and θ is the wave’s direction of propagation. After arranging the integrand as an
Spontaneous Waves and Dendritic Growth
549
Figure 6: Schematic representation of the variables used to determine the wave’s covariance functions. Two vectors Er and Er0 (dashed arrows) originate at the ganglion cell soma (shown at the origin) and end at two amacrine synapses (represented by filled circles). The orientations of these vectors are represented by E r and u E r0 . A wave propagates with a direction represented by the unit vectors u E w and has its peak amplitude (represented by a bold, straight the unit vector u line) at a distance x from the ganglion cell. The amacrine dendrites are shown as solid lines of length L.
exponential of a perfect square, the integral in x is straightforward and gives √ σω π/2. As constants are omitted from the analysis, equation A.2 becomes Z
à ¡
¢2 !
|Er| cos(α − θ ) − |Er0 | cos(α 0 − θ ) exp − Q(Er, Er ) ∼ 4σω2 −π 0
π
dθ.
(A.3)
Using trigonometric manipulations, it can be shown that ¡
¢2
|Er| cos(α − θ) − |Er0 | cos(α 0 − θ )
= |Er − Er 0 |2 cos2 (θ − γ ),
(A.4)
where γ = tg−1 (|Er| sin α − |Er0 | sin α 0 )/(|Er| cos α − |Er0 | cos α 0 ). Using this definition and the new variable φ = θ − γ (the integral limits does not change as γ is an arbitrary angle), equation A.3 can be integrated with respect to φ (Gradshteyn & Ryzhik, 1980) to yield µ ¶ µ ¶ |Er − Er 0 |2 |Er − Er 0 |2 I , Q(Er, Er 0 ) ∼ exp − 0 8σω2 8σω2
(A.5)
where I0 (x) is the zero-order Bessel function of an imaginary argument.
550
Pierre-Yves Burgi and Norberto M. Grzywacz
For the calculation of the covariance function for waves hitting polarized dendrites, the dendrite was modeled as a straight process with a dendrodendritic synapse contacting the ganglion cell in a random position in this cell’s gaussian dendritic tree. An arrow beginning at the middle of the dendrite and ending at the synapse always pointed toward the middle of the ganglion cell tree. The synaptic response at each instant was taken to be proportional to the integral of the values of the wave along the length of the dendrite, that is, Z FA (Er) ∼
|Er|+L
|Er|
µ ¶ (E ur · uEω x0 + x)2 exp − dx0 , 2σω2
(A.6)
where uEr is a unit vector in the direction of Er and L is dendritic length (see Fig. 6). As before, the correlation between the activity of two amacrine cells, FA (Er) and FA (Er0 ), was computed as the mean correlation due to waves moving in all possible directions: Q(Er, Er0 ) =
µ ¶ (cos(α − θ ) x0 + x)2 exp − dx0 2σω2 −∞ −π |Er| ¶ µ Z |Er0 |+L (cos(α 0 − θ ) x00 + x)2 dx00 . × exp − (A.7) 2σω2 |Er0 | Z
Z
∞
Z
π
dx
|Er|+L
dθ
Integrating with respect to x and using the same trigonometric manipulations as before, we get 0
Q(Er, Er ) ∼
Z
|Er|+L Z |Er0 |+L
|Er|
|Er0 |
Ã
120 00 exp − x x2 8σω
!
à I0
12x0 x00 8σω2
! dx0 dx00
(A.8)
where 1x0 x00 = |x0 uEr − x00 uE0r |. Acknowledgments We are grateful to Kenneth D. Miller and Anthony M. Norcia for their fruitful comments on an earlier version of this article. This work was supported by a grant from the Swiss National Fund for Scientific Research (8220-37180) to P.-Y.B., by grants from the National Eye Institute (EY-08921) and U.S. Office of Naval Research (N00014-91-J-1280), by the William A. Kettlewell chair to N.M.G., and by a core grant from the National Eye Institute to Smith-Kettlewell (EY-06883).
Spontaneous Waves and Dendritic Growth
551
References Burgi, P.-Y., & Grzywacz, N. M. (1994a). Does prenatal development of ganglioncell receptive fields depend on spontaneous activity and Hebbian processes? Invest. Ophthalmol. Vis. Sci., 35, 2126. Burgi, P.-Y., & Grzywacz, N. M. (1994b). Model based on extracellular potassium for spontaneous synchronous activity in developing retinas. Neural Computation, 6, 981–1002. Burgi, P.-Y., & Grzywacz, N. M. (1994c). Model for the pharmacological basis of spontaneous synchronous activity in developing retinas. J. Neurosci., 14, 7426–7439. Dacheux, R. F., & Miller, R. F. (1981a). An intracellular electrophysiological study of the ontogeny of functional synapses in the rabbit retina. I. Receptors, horizontal and bipolar cells. J. Comp. Neurol., 198, 307–326. Dacheux, R. F., & Miller, R. F. (1981b). An intracellular electrophysiological study of the ontogeny of functional synapses in the rabbit retina. II. Amacrine cells. J. Comp. Neurol., 198, 327–334. De Juan, J., Grzywacz, N. M., Guardiola, J. V., & Sernagor, E. (1995). Anatomical and physiological dark-rearing induced changes in turtle retinal development. Invest. Ophthalmol. Vis. Sci., 36, 60. De Juan, J., Grzywacz, N. M., Guardiola, J. V., & Sernagor, E. (1996). Coincidence of synaptogenesis and emergence of spontaneous and light-evoked activity in embryonic turtle retina. Invest. Ophthalmol. Vis. Sci., 37, 634. Dowling, J. E. (1987). The retina. Cambridge, MA: Harvard University Press. Dunlop, S. A. (1990). Early development of retinal ganglion cell dendrites in the marsupial setonix brachyurus, quokka. J. Comp. Neurol., 293, 425–447. Feng, J., Pan, H., & Roychowdhury, V. P. (1995). Linsker-type Hebbian learning: A quantitative analysis on the parameter space (Tech. Rep. No. TR-EE 95-12). Purdue University. Gradshteyn, I. S., & Ryzhik, I. M. (1980). Table of integrals, series, and products. Orlando, FL: Academic Press. Horsburgh, G. M., & Sefton, A. J. (1987). Cellular degeneration and synaptogenesis in the developing retina of the rat. J. Comp. Neurol., 263, 553–566. Jacobson, M. (1991). Developmental neurobiology. New York: Plenum Press. Kolb, H. (1982). The morphology of the bipolar cells, amacrine cells and ganglion cells in the retina of the turtle Pseudemys scripta elegans. Philos. Trans. R. Soc. B, 298, 355–393. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of orientation-selective cells. Proc. Natl. Acad. Sci. USA, 83, 8390–8394. MacKay, D. J. C., & Miller, K. D. (1990). Analysis of Linsker’s application of Hebbian rules to linear networks. Network, 1, 257–297. Maffei, L., & Galli-Resta, L. (1990). Correlation in the discharges of neighboring rat retinal ganglion cells during prenatal life. Proc. Natl. Acad. Sci. USA, 87, 2861–2864. Masland, R. H. (1977). Maturation of function in the developing rabbit retina. J. Comp. Neurol., 175, 275–286.
552
Pierre-Yves Burgi and Norberto M. Grzywacz
Maslim, J., & Stone, J. (1986). Synaptogenesis in the retina of the cat. Brain Research, 373, 35–48. Maslim, J., Webster, M., & Stone, J. (1986). Stages in the structural differentiation of retinal ganglion cells. J. Comp. Neurol., 373, 35–48. Meister, M., Wong R. O. L., Baylor, D. A., & Shatz, C. J. (1991). Synchronous bursts of action potentials in ganglion cells of the developing mammalian retina. Science, 252, 939–943. Miller, K. D. (1994). A model for the development of simple cell receptive fields and the ordered arrangement of orientation columns through activitydependent competition between ON- and OFF-center inputs. J. Neurosci., 14, 409–441. Ramoa, A. S., Campbell, G., & Shatz, C. J. (1988). Dendritic growth and remodelling of cat retinal ganglion cells during fetal and postnatal development. J. Neurosci., 8, 4239–4261. Rusoff, A. C., & Dubin, M. W. (1977). Development of receptive-field properties of retinal ganglion cells in kittens. J. Neurophysiol., 40, 1188–1198. Sernagor, E., & Grzywacz, N. M. (1993). Cellular mechanisms underlying spontaneous correlated activity in the turtle embryonic retina. Invest. Ophthalmol. Vis. Sci., 34, 1156. Sernagor, E., & Grzywacz, N. M. (1994). Role of early spontaneous activity and visual experience in shaping complex receptive field properties of ganglion cells in the developing retina. Neurosci. Abst., 20, 1470. Sernagor, E., & Grzywacz, N. M. (1995a). Emergence of complex receptive field properties of ganglion cells in the developing turtle retina. J. Neurophysiol., 73, 1355–1364. Sernagor, E., & Grzywacz, N. M. (1995b). Shaping of receptive field properties in developing retinal ganglion cells in the absence of early cholinergic spontaneous activity. Neurosci. Abst., 21, 1503. Sernagor, E., & Grzywacz, N. M. (1996). Influence of spontaneous activity and visual experience on developing retinal fields. Current Biology, 6, 1503–1508. Stetter, M., Lang, E. W., & Muller, ¨ A. (1993). Emergence of orientation selective simple cells simulated in deterministic and stochastic neural networks. Biol. Cybern., 68, 465–476. Tootle, J. S. (1993). Early postnatal development of visual function in ganglion cells of the cat retina. J. Neurophysiol., 69, 1645–1660. Vanselow, J., Dutting, D., & Thanos, S. (1990). Target dependence of chick retinal ganglion cells during embryogenesis: Cell survival and dendritic development. J. Comp. Neurol., 295, 235–247. von der Malsburg, C. (1973). Self organization of orientation selective cells in the striate cortex. Kybernetik, 14, 85–100. Wilkinson, J. H., & Reinsh, C., (1971). Linear algebra. New York: Springer-Verlag. Wong, R. O. L., Herrmann, K., & Shatz, C. J. (1991). Remodeling of retinal ganglion cell dendrites in the absence of action potential activity. J. Neurobiol., 22, 685–697. Wong, R. O. L., Meister, M., & Shatz, C. J. (1993). Transient period of correlated bursting activity during development of the mammalian retina. Neuron, 11, 923–938.
Spontaneous Waves and Dendritic Growth
553
Wong, R. O. L., Chernjavsky, A., Smith, S. J., & Shatz, C. J. (1995). Early functional neural networks in the developing retina. Nature, 374, 716–718. Yuille, A. L., Kammen, D. M., & Cohen, D. (1989). Quadrature and the development of orientation selective cortical cells by Hebb rules. Biol. Cybern., 61, 183–194.
Received November 14, 1995; accepted July 9, 1996.
Communicated by Klaus Pawelzik
Singularities in Primate Orientation Maps K. Obermayer Fachbereich Informatik, TU Berlin, 10587 Berlin, Germany
G. G. Blasdel Department of Neurobiology, Harvard Medical School, Boston, MA 02215, USA
We report the results of an analysis of orientation maps in primate striate cortex with focus on singularities and their distribution. Data were obtained from squirrel monkeys and macaque monkeys of different ages. We find that approximately 80% of singularities that are nearest neighbors have the opposite sign and that the spatial distribution of singularities differs significantly from a random distribution of points. We do not find evidence for consistent geometric patterns that singularities may form across the cortex. Except for a different overall alignment of orientation bands and different periods of repetition, maps obtained from different animals and different ages are found similar with respect to the measures used. Orientation maps are then compared with two different pattern models that are currently discussed in the literature: bandpass-filtered white noise, which accounts very well for the overall map structure, and the field analogy model, which specifies the orientation map by the location of singularities and their properties. The bandpass-filtered noise approach to orientation patterns correctly predicts the sign correlations between singularities and accounts for the deviations in the spatial distribution of singularities away from a random dot pattern. The field analogy model can account for the structure of certain local patches of the orientation map but not for the whole map. Neither of the models is completely satisfactory, and the structure of the orientation map remains to be fully explained. 1 Introduction Based on Hubel and Wiesel’s original findings (Hubel & Wiesel 1974), many experiments were performed to determine the two-dimensional lateral distribution of orientation selectivity in the upper layers of the primate striate cortex (Bartfeld & Grinvald 1992; Blasdel 1992a, 1992b; Blasdel & Salama 1986; Blasdel et al. 1995; Obermayer & Blasdel 1993). The results gave rise to several pattern models (Cowan & Friedman 1991; Niebur & Worg ¨ otter ¨ 1994; Rojer & Schwartz 1990; Schwartz 1994; Swindale 1982; for Neural Computation 9, 555–575 (1997)
c 1997 Massachusetts Institute of Technology °
556
K. Obermayer and G. G. Blasdel
further references see Erwin et al. 1995) designed for their interpretation. Before imaging methods were available, pattern models were used to infer a two-dimensional spatial pattern from one-dimensional or sparse twodimensional data recorded with microelectrodes. Today two-dimensional patterns can be measured directly, but pattern models are still valuable tools. Pattern models parameterize the observed data and are important for quantitative comparisons of all sorts, for example, between patterns observed in different species, between patterns observed during development, and between data and models. In this article we report an analysis of singularities in primate orientation maps (Blasdel 1992a, 1992b; Obermayer & Blasdel 1993; see Bonhoeffer & Grinvald 1991 for similar structures in the cat cortex) that were measured by optical methods. Data were obtained from squirrel monkeys and macaque monkeys of different ages. We chose these two species because macaque monkeys have ocular dominance columns while squirrel monkeys do not, and we chose macaques of different ages in order to search for changes during development. Our experimental results are then compared with pattern models: (1) bandpass-filtered white noise (Cowan & Friedman 1991; Erwin et al. 1995; Freund 1994; Niebur & Worg ¨ otter ¨ 1994; Rojer & Schwartz 1990; Schwartz 1994; Shvartsman & Freund 1994; Swindale 1982), which assumes that orientation maps can be fully described by their two-point correlation function, and (2) the field analogy model (FAM) (Wolf et al. 1994a, 1994b), which claims that orientation maps are fully determined by their singularities and that the orientation map in between is optimally smooth. The article is divided into five sections. In section 2 we describe the data, their mathematical representation, and the methods for data analysis. In the third section we introduce both pattern models. The fourth section contains the results of our data analysis and comparison between data and models, and the fifth section concludes with a brief discussion.
2 Data: Singularities and Methods of Analysis 2.1 Data. We analyzed 12 orientation maps obtained with optical imaging using standard methods (Blasdel 1992a, 1992b). The data (see Table 1) include six maps from adult macaques (varieties nemestrina and fascicularis) and five maps from baby macaques of different ages. Pictures of the spatial patterns of orientation selectivity and details of the experimental procedures can be found in the publications cited in Table 1. For the purpose of this article, each “raw” data set corresponds to a two-dimensional pixel image with 2 bytes per pixel, where the first byte codes for the orientation preference and the second byte for the orientation tuning strength. Raw data were converted to a complex orientation map, which was used for subsequent analyses. Every pixel corresponds to 8.6 µm for high-power and 15.6 µm for low-power images, respectively.
Singularities in Primate Orientation Maps
557
Table 1: Orientation Maps Used in This Study Species
Name
Age (wks)
Size (mm2 )
Macaca nemestrina
NM1a NM2a NM3a NM4a BN1b BN2b BN3b BN4b FS1a,d FS2a,d FS3b SM1c
adult adult adult adult 14.0 9.0 7.5 5.5 adult adult 3.5 adult
14.5 14.5 48.0 48.0 14.5 48.0 14.5 48.0 48.0 48.0 14.5 10.3
Macaca fascicularis
Samiri sciureus
Density of singularities (mm−2 ) pos. sign neg. sign 3.9 4.7 3.9 3.6 3.9 3.7 4.5 3.7 4.4 4.2 3.9 5.3
3.8 4.7 3.8 3.2 3.8 3.7 4.5 3.7 4.4 4.0 3.9 5.8
λOR [µm] 768 645 621 719 700 658 615 714 676 647 660 512
Note: The average wavelength λOR of the orientation map was averaged over all directions in the cortex, neglecting the anisotropy of the orientation pattern. The density of singularities does not necesssarily correlate with the average spatial frequency, because the density also depends on the coherence properties (see Obermayer et al. 1992) of the maps. a
Data from Obermayer & Blasdel (1993). Data from Blasdel et al. (1995). c Unpublished data. d Data from two adjacent regions of the same animal. b
2.2 Mathematical Description of Orientation Maps. The spatial distribution of orientation selectivity can be mathematically described by a complex orientation field O(Er) (Swindale 1982), O(Er) = q(Er) exp(2i φ(Er)),
(2.1)
where i is the imaginary unit. The phase φ(Er) denotes the preferred orientation at the cortical location Er = (x, y)T , and the amplitude q(Er) is a measure for orientation tuning. The factor of two must be introduced because orientation preferences are π periodic while the phase φ(Er) has period 2π . If we adopt this mathematical description, singularities in orientation maps correspond to the zeroes of the function O(Er). Close to a singularity, the orientation field O(Er) can be linearized. The singularity can then be described by the two parameters “location Erj = (xj , yj ) of the zero-crossing” and the four components of the normal vectors of the real and the imaginary tangential planes. For an easier interpretation of the pattern of iso-orientation lines close to a singularity, it is convenient to combine the four components of the normal vectors into four other quantities: amplitude aj ∈ R+ 0 , anisotropy αj ∈ R, angle ρj ∈ [0, π ], and skewness σj ∈ [0, π ] (Freund 1994). The orientation field Oj around a singularity j is
558
K. Obermayer and G. G. Blasdel
then given by Oj = Xj + iαj Yj ,
(2.2)
where Xj = aj {(x − xj ) cos(ρj )
+ (y − yj ) sin(ρj )}
Yj = aj {−(x − xj ) sin(ρj + σj ) + (y − yj ) cos(ρj + σj )}
(2.3)
denote the cartesian components of the orientation vector. The sign vj of a singularity j is given by vj = sign(αj ).
(2.4)
It is positive if orientation preferences increase in a clockwise direction around the singularity and negative otherwise. 2.3 Measurement of Locations, Signs, and Angles. In order to determine the position of singularities, changes of orientation preferences are added clockwise along small circles of radius 50 µm around every point in the image. If the sum 0 along a closed loop is close to ±180 degrees, then a singularity is enclosed by the circle, and the pixel is marked by +1 or −1, where ±1 is the sign vj of the singularity. Otherwise the pixel is marked by 0.1 Note that 0 assumes multiples of 180 degrees within the numerical precision of the calculation. Disks of radius 50 µm are then fitted to the regions marked +1 and −1, and the location of the center of a disk having maximum overlap with the corresponding region is taken for the spatial location of the singularity. For the calculation of orientation angles, an arbitrary direction is chosen as the cortical x-axis, and the orientation value of the pixel, which is located at a distance of 5 pixels along the x-axis from the center of singularity, is determined. 2.4 Analysis of Local Geometric Patterns of Singularities. One cannot expect to find singularities in a regular grid in a crystal-like fashion. Even if local geometric patterns exist, they are quite likely subject to changes in position, orientation, and size due to inherent biological variability or artifacts of the measurement procedures. Methods to detect local features should be robust against these changes. In the following, we consider a method that is invariant under translation, rotation, and scaling. 1 If 0 is close to ±360 degrees, then a singularity is enclosed, where orientation preferences cycle twice around it. Singularities with 0 close to ±360 degrees, however, never occur in the experimental data.
Singularities in Primate Orientation Maps
559
The local pattern Sj (Er) around a singularity j can be described by a sum of δ-functions, Sj (Er) =
X
δ(Er − Erl ),
(2.5)
l∈Dj
where Dj denotes a circular region around a singularity j. The radius d of Dj was chosen to be 275 µm (32 pixels) for high-power images and 500 µm (32 pixels) for low-power images, respectively. In order to achieve a representation of the local pattern (equation 2.5) that is invariant under translation, rotation, and scaling, we apply the following transformations: 1. Fourier transformation and neglect of phase: E = |F (Sj (Er))|. Sˆ j (k)
(2.6)
E is now invariant under translation. Sˆ j (k) 2. Transformation to polar coordinates (κ, φ), ky = κ cos φ and kx = κ sin φ: S˜ j (κ, φ) = Sˆ j ({κ cos φ, κ sin φ}T ).
(2.7)
3. Fourier transformation and neglect of phase: Tj (tE) = |F (S˜ j (κ, φ))|.
(2.8)
Tj (tE) is now invariant under rotation and scaling as well. Patterns are averaged over all singularities, which are located more than a distance d away from the border of the orientation map. 3 Models 3.1 The Field Analogy Model. Historically, the field analogy model was motivated by the similarity between contour plots of orientation maps and the lines of force between electric point charges in the plane (Wolf et al. 1994a). This similarity is based on four assumptions: 1. Iso-orientation lines do not form spirals. 2. Iso-orientation lines begin and end at singularities. 3. The complete sequence of orientation is represented only once around a singularity. 4. Singularities are isotropic and do not show skewness (σj = αj = 0).
560
K. Obermayer and G. G. Blasdel
Let us describe the electric field, which is generated by a distribution of point charges in the plane, via its complex potential U, U = 8 + i9.
(3.1)
The lines of equal 8 correspond to the iso-potential lines of the charge distribution. The lines of equal 9 correspond to the lines of force. For a set of point charges j with charge vj at locations (xj , yj ), the potential U is given by X vj Ln(z − zj ), (3.2) U=− j
where z = x + iy, zj = xj + iyj , and Ln(z) = ln |z| + i arg(z). The lines of force are then given by the lines of equal 9, Y −v 9 = arg (z − zj ) j .
(3.3)
j
For unit charges vj = ±1 we obtain Y Y 9(x, y) = arg {(x − xj ) − ivj (y − yj )} = arg Oj .
j
(3.4)
j
Equation 3.4 defines the FAM. It assumes that the orientation preference map is fully determined by the locations and signs of its singularities. Equation 3.4 can also be derived from a biologically more relevant smoothness criterion (Wolf et al. 1994b). It has been proposed that orientation maps are characterized by the conflicting aims of continuity and diversity (Erwin et al. 1995) and that orientation preferences are mapped across the cortex as smoothly as possible under certain constraints (Obermayer et al. 1990). Let us therefore define an appropriate cost function U, U=
1 2
Z dEr |∇9(Er)|2 ,
(3.5)
V
via the magnitude of the gradient of orientation preferences. Because U is proportional to the electrostatic energy for a given distribution of point charges, it is minimized by the orientation map of equation 3.4. The FAM thus claims that orientation preference maps are optimally smooth in the sense of equation 3.5 under the constraint that its singularities are isotropic and given by {(xj , yj , vj )}. In order to test the FAM approach, locations and signs of singularities are determined from the experimental data, and the orientation map is
Singularities in Primate Orientation Maps
561
reconstructed using equation 3.4. The error R between the original map 2(x, y) and its reconstruction 9(x, y) is calculated by R2 =
ª2 1 X© g(9(x, y) − 2(x, y) + 2ref , N x,y
(3.6)
where (x, y) are pixel locations, N is the number of pixels in the sum, 9 is given by equation 3.4, 2 denotes the original orientation preference, and 2ref is chosen such that R is minimal. The function g is given by φ+π φ g(φ) = φ−π
if φ < −π/2 if − π/2 ≤ φ < π/2 if φ ≥ π/2.
(3.7)
Due to border effects, the sum in equation 3.6 is only over pixels that are at least λOR away from the border. 3.2 Bandpass-Filtered Noise. Bandpass-filtered noise (BFN) models are a class of pattern models that assume that orientation maps can be fully described by their two-point correlation functions. BFN was first suggested by Swindale (1982) and was discussed in more detail later (Cowan & Friedman 1991; Erwin et al. 1995; Freund 1994; Niebur & Worg ¨ otter ¨ 1994; Rojer & Schwartz 1990; Schwartz 1994; Shvartsman & Freund 1994), in particular by Schwartz and colleagues. ˆ and Let us denote the power spectrum of the orientation map O by O, let Oˆ be given by (Obermayer & Blasdel 1993; Obermayer et al. 1992)2 Ã
2 E E ∼ exp −2 (|k| − k0 ) ˆ k) O( 2 σk
! .
(3.8)
BFNs claim that the essential properties of the orientation maps are captured by patterns O(Er) = F
−1
¶ µq E E ˆ O(k) · exp(iχ(k)) ,
(3.9)
E are random variables taken from the interval [0, 2π ] where the phases χ (k) with equal probability. The only model parameters that have to be adjusted (in the isotropic case) are the wave number k0 and the width σk . Models 2 In most species we find slightly elliptical power spectra (Obermayer & Blasdel 1993). For simplicity we consider isotropic spectra as given by equation 3.8, but the formalism can easily be extended to elliptic spectra (Schwartz 1994).
562
K. Obermayer and G. G. Blasdel
similar to equation 3.9 have also been studied in the physics literature as models of optical fields that are generated when coherent light is scattered by a random medium (Freund 1994; Freund & Shvartsman 1994; Shvartsman & Freund 1994). Although BFN is primarily thought of as convenient parametrizations of orientation maps, it can be linked to processes underlying development. Swindale (1982) was among the first to point out the relation of equation 3.8 to Mexican hat–shaped cortical interactions and generated quite realistic orientation maps using BFN-type models. Later Schwartz and colleagues explored those models in detail (Rojer & Schwartz 1990; Schwartz 1994) and demonstrated how the variety of stripe and bloblike patterns, as well as the network of singularities, can emerge under simple and plausible assumptions. Linsker (1986), Miller et al. (1989), Miller (1994), and Obermayer et al. (1992) investigated models based on Hebbian learning. They showed that BFN is a robust phenomenon in models of Hebbian learning and, given that those models capture reality, also a robust phenomenon in neural development. BFN emerges from the “linear part” of the developmental dynamics, which predicts the growth of a set of independent Fourier-like eigenmodes. The shape of the BFN filter is then related to the spectrum of eigenvalues, which in turn is related to the shape of the lateral connectivity within the cortex. Consequently, several elements of BFN maps are similarly predicted by models based on Hebbian learning. Note that BFNs are not in general “optimally smooth” in the sense of equation 3.5; hence, optimal smoothness is not expected for Hebbian models either. For our numerical simulations we choose k0 = 1 in equation 3.8 without loss of generality and obtain σk ∈ [0.2, 0.26] from our experimental data (Obermayer & Blasdel 1993). Numerical simulations were performed with σk = 0.25 and a discrete lattice of units, where k0 = 1 corresponds to a period of the orientation pattern of 12.8 pixels. 4 Results 4.1 Phase-Portraits of Primate Singularities. Figure 1 shows four examples of singularities found in our data from the primate cortex. Although many singularities are isotropic, as shown in Figure 1a, visual inspection exhibits singularities with a considerable degree of anisotropy or skewness (see Figures 1b and 1c). This is particularly true for the squirrel monkey map, where iso-orientation bands align and give rise to nonisotropic singularities. 4.2 Comparison of FAM Reconstructions with Data. Figure 2 shows the iso-orientation lines of the macaque map NM1 and the squirrel monkey map SM1 in comparison with the iso-orientation lines of the corresponding FAM reconstructions (right). The macaque map shows a pattern of iso-orientation contours, which is in qualitative agreement with the experimental data in the sense that iso-orientation lines begin and end at
Singularities in Primate Orientation Maps
563
Figure 1: Contour plot representation of the orientation map around selected singularities, 400 µm × 400 µm in size, taken from NM1. Black lines indicate isoorientation contours in intervals of 12.5 degrees. Note that not all singularities are pinwheel-like as the singularity shown on the top left (a). Iso-orientation contours are often compressed along one axis and expanded along the other axis, as shown in (b) and (c). This effect is a result of finite anisotropy or skewness. Figure d shows a singularity with two axes of compression.
corresponding singularities in both maps. Details of the contour map are different (arrows), though, and lead to errors in the prediction of actual orientation preferences assigned to map locations. The reconstruction of the squirrel monkey map exhibits a contour pattern different from its original in the sense that iso-orientation lines do not begin and end at corresponding singularities (arrows). This is not surprising, because the squirrel monkey map deviates most from the assumption of isotropic singularities. Table 2 summarizes the average reconstruction errors R (cf. equation 3.6). Reconstruction errors are fairly small for small areas but approach chance
564
K. Obermayer and G. G. Blasdel
Figure 2: Contour plot representations of orientation maps for NM1 (3.3 mm × 4.4 mm) and SM1 (2.9 mm × 3.6 mm) in comparison with the corresponding FAM reconstructions. Iso-orientation contours denote intervals of 12.5 degrees for the NM1- and 15.0 degrees for the SM1-map. The arrows indicate corresponding regions in the reconstructions (right) and their originals (left).
level for larger areas.3 This shows, that the FAM captures some of the local structure in the orientation maps. In regions where anisotropic singularities are abundant, fractures occur, and, like SM1, the orientation map is strongly anisotropic, the FAM approach breaks down, and errors approach chance level. The FAM reconstructions of patterns generated by BFN lead to large errors (see Table 2). The main reason is the large number of fractures in the BFN maps, which violate the FAM assumptions of optimal smoothness: isoorientation lines in BFN maps seem to “attract” each other and form narrow 3 Errors are always below chance level because of the minimization procedure described in section 3.1.
Singularities in Primate Orientation Maps
565
Table 2: Average reconstruction Error R (14) per Pixel for the FAM Maps for an Area of Reconstruction of 4.7 mm2 (left column) and 24.1 mm2 (right column) Located in the Center of the Data Sets Name
NM1 NM2 NM3 NM4 BN1 BN2 BN3 BN4 FS1 FS2 FS3 SM1 BFN
R 4.7 mm2 (degrees)
R 24.1 mm2 (degrees)
0.69 n.a. 0.53 n.a. 0.60 0.79 0.78 0.84 0.68 n.a. 0.58 0.76 0.68 n.a. 0.50 0.73 0.81 0.85 0.68 0.73 0.64 n.a. 0.84 for 2.3 mm2 0.88
Note: The value for BFN was from a data set of 460 × 460 pixels containing approximately √ 1000 singularities. Chance level is R = π/ 12 = 0.907.
regions of rapid changes in orientation preferences (see Erwin et al. 1995, Figure 6b), while the FAM predicts that iso-orientation lines repel each other as lines of force do. The behavior of iso-orientation lines in experimental maps is somewhat between these extremes, and neither FAM nor BFN fully captures the local map structure. 4.3 Distribution of Singularities, Sign Correlations, and Orientation Angles. Figure 3 shows the locations of singularities for NM2 and SM1 in comparison with data generated by BFN. Visual inspection does not reveal any consistent local pattern of singularities but shows that nearest neighbors usually have the opposite sign. This remains true for the anisotropic map SM1, though the pattern is compressed along the left oblique and expanded along the right oblique axis. Table 3 shows the sign correlation between nearest to fifth-nearest neighbors for all data sets in comparison with BFN. Approximately 80% of nearest neighbors and 60% to 70% of the second-nearest neighbors are of the opposite sign. For singularities farther apart, same and different signs occur approximately equally often. Note that there are no significant changes with
51
199
252
45
222
54
238
242
285
48
57
3907
NM2
NM3
NM4
BN1
BN2
BN3
BN4
FS1
FS2
FS3
SM1
BFN
same diff same diff same diff same diff same diff same diff same diff same diff same diff same diff same diff same diff same diff
4 43 4 45 12 40 8 44 7 49 6 43 11 41 11 40 9 41 8 43 10 46 14 35 9 41
++ +− 9 43 10 41 9 39 8 41 2 42 6 45 15 33 7 42 8 42 8 41 6 38 9 42 10 40
First (%) −− −+ 13 87 14 86 21 79 15 85 9 91 13 87 26 74 18 82 17 83 16 84 17 83 23 77 19 81
6 6 26 22 24 25 21 32 19 32 27 29 17 32 24 28 17 34 17 33 19 32 29 27 14 35 18 33
++ +− 17 35 18 33 17 31 19 30 13 31 27 24 22 26 15 33 19 31 18 31 10 33 21 30 16 33
43 57 41 59 38 62 38 62 40 60 44 56 46 54 33 67 36 64 37 63 40 60 35 65 34 66
Second (%) −− 6 −+ 6 15 33 27 22 24 28 27 25 27 29 26 23 26 26 28 23 24 26 25 26 19 38 21 28 21 29
++ +− 33 20 31 20 19 29 24 25 18 27 22 29 11 37 22 27 29 21 26 23 15 29 26 25 21 29
Third (%) −− −+ 48 52 59 41 43 57 51 49 44 56 48 52 37 63 51 49 53 47 51 49 33 67 47 53 42 58
6 6 24 24 29 20 26 27 27 25 27 29 26 23 26 26 27 25 28 22 23 28 33 23 26 23 25 25
++ +− 26 26 22 29 25 23 27 22 27 18 22 29 24 24 27 22 24 26 19 30 23 21 23 28 24 25
Fourth (%) −− −+ 50 50 51 49 50 50 53 47 53 47 48 52 50 50 53 47 52 48 42 58 56 44 49 51 49 51
6 6 24 24 22 27 24 28 26 25 27 29 24 25 28 24 26 25 31 19 24 27 27 29 16 33 27 23
++ +− 22 30 18 33 25 23 21 28 29 16 26 25 20 28 24 25 22 28 29 20 27 17 21 30 27 23
Fifth (%) −− −+
46 54 39 61 49 51 46 54 56 44 50 50 48 52 50 50 53 48 53 47 54 46 37 63 54 46
6 6
Note: The percentage of pairs ++, −−, +−, and −+ is given for nearest to fifth-nearest neighbors. Rows marked “same” give numbers for equal sign pairs, rows marked “diff” numbers of opposite sign pairs. Percentages do not add up because of rounding errors; singularities close to the image borders have not been used.
46
Number of pair singularities same diff
NM1
Name
Table 3: Sign Correlations of Close-by Singularities
566 K. Obermayer and G. G. Blasdel
Singularities in Primate Orientation Maps
567
Figure 3: Locations of singularities for the data set NM2, SM1, and BFN. The locations of positive and negative singularities are marked by + and −. The arrows indicate the direction toward area 18. Scale bars are 1 mm in size.
age. The strong correlations in sign are consistent with the predictions of BFN. For fifth-nearest neighbors, BFN predicts a slight overrepresentation of pairs of same sign, but the data are too noisy to confirm this overrepresentation. In order to characterize the distribution of orientation angles, angles were pooled into six bins of 30 degree size and compared to the hypothesis that all orientation angles occur equally often by a χ 2 test. All sets of data except BN2 are consistent with this hypothesis for α = 0.05.4 There are no significant differences between macaque and squirrel monkeys and between animals of different ages. The predictions of BFN also pass the χ 2 test and are in agreement with the experimental data. 4.4 Distribution of Distances Between Nearest Neighbors. Figure 4 shows histograms of the distances between singularities that are nearest 4
BN2 is consistent with this hypothesis on a level of significance α = 0.01.
568
K. Obermayer and G. G. Blasdel
Figure 4: Histograms of distances between singularities that are nearest neighbors, independent of sign. Distances were normalized to the square root of the inverse density of singularities and divided into bins of size one-eighth. The percentage of distances that fall into the corresponding bins is plotted against the value corresponding to the center of the bin. Data points are connected by lines. We used 242 pairs for FS1, 238 pairs for BN4, and 3907 pairs for BFN. The percentage of pairs for RAN was calculated by integrating equation 4.1 over each bin. BFN parameters were taken from section 3.2.
neighbors, independent of sign. Let us first compare the distributions of distances with a distribution that arises from independent, randomly distributed points in the plane (RAN). Let ρ be the number density of points, √ and let 1 = 1/ ρ be the square root of the average area per point. The probability density P(d) of distances d between nearest neighbors is then given by ( µ ¶2 ) d d . P(d) = 2π exp −π 1 1
(4.1)
Figure 4 shows that the peaks of the “experimental” distributions of FS1 and BN4 are shifted to the right of the RAN distribution by approximately 0.151. Similar histograms have been constructed for all data sets and have been compared to the hypothesis RAN by a χ 2 test. The hypothesis RAN has to be rejected for all sets of data (α = 0.05), and the experimental distributions were shifted toward larger distances for all cases. There were no obvious trends between different species and between animals of different ages. A similar result was obtained for the distribution of distances between nearest neighbors of the same sign (data not shown). The hypothesis RAN has to be rejected for the same reasons, except for NM1 and BN3. Let us next compare the distribution of distances with the predictions of BFN. BFN predicts a distribution shifted by approximately 0.151 to the right of the experimental distribution; that is, singularities are farther apart
Singularities in Primate Orientation Maps
569
on average. A χ 2 test rejects the hypothesis BFN for all data sets (α = 0.05) except for BN1. For the case of singularities of the same sign (data not shown), BFN predicts a distribution that is closer to the experimental findings in the sense that peak location is shifted only slightly. A χ 2 test still rejects the hypothesis BFN for most data sets (α = 0.05), except for NM2, BN1, BN3, and SM1. Again, there were no obvious trends between different species and between animals of different ages. In summary, we are left with the puzzling fact that while singularities of opposite sign attract each other, singularities on the whole weakly repel themselves compared to a random distribution. The fact that the distribution of distances does not quite fit the BFN predictions should not be taken too seriously, however, because methodological artifacts tend to shift the distribution toward RAN (see section 5.1). 4.5 Local Patterns of Singularities. Figures 5 and 6 show an evaluation of local patterns of singularities using the procedure described in section 2.4. The average spectra T(tE) are shown as a function of tE = {ts , tr }, where the rotation axis tr is plotted to the right and the scale axis ts is plotted to the top. The oscillation with period 2 along the rotation axis reflects the fact that the patterns Sj are real and their Fourier transforms Sˆ j are conjugate complex. Any consistent local arrangement of singularities should give rise to additional structure in the spectra. The spectra of the data sets FS1, BN4, SM1, and BFN are compared with structures that arise when singularities are arranged locally in patterns with four- and sixfold symmetry (bottom row of Figure 5). Visual inspection shows additional structure in the case of the artificial patterns with six- and fourfold symmetry, which is lacking in the experimental data and in the predictions of BFN. In order to demonstrate the effect of rotational symmetry more clearly, the values of T(tE) are added along the scaling axis ts in Figure 6. Only every second value is shown in order to suppress the oscillation of period 2. Although there are strong oscillations in the case of the “square” and “triangle” patterns, no evidence for symmetries is found for BN4 and for the BFN model. Similar results were obtained for all data sets of Table 1. A similar analysis for the scaling axis did not provide evidence for symmetries either. Thus, no evidence for consistent local patterns of singularities was found in the data, which is consistent with the visual impression one gains from Figure 3. The same analysis for local patterns of same-sign singularities led to the same result (data not shown). 5 Discussion 5.1 Measurement Errors and Possible Artifacts. The reliability of optical imaging data has been addressed in several publications (e.g., Blasdel
570
K. Obermayer and G. G. Blasdel
Figure 5: Doubly transformed second amplitude-spectra T(Et) of local patterns of singularities. Each pixel corresponds to one mode, and its gray-value indicates its amplitude. The origins are in the center of the squares. The patterns are averages of 200 local regions of radius 500 µm (FS1, BN4), 100 local regions of radius 390 µm (SM1), 500 small square and hexagonal gratings in different positions, orientations, and sizes, and 500 local regions of radius 20 pixels for BFN. Each region typically contained 5 to 10 singularities. The spectra are robust with respect to the number of singularities and the size of the local regions chosen for analysis.
1992a, 1992b; Blasdel & Salama 1986; Bonhoeffer & Grinvald 1996), to which we refer the reader for details. Multiple optical recordings from the same area of the cortex, for example, lead to similar spatial maps of neuronal responses (Bonhoeffer & Grinvald 1996). The hypothesis that cells with different orientation preferences are actually arranged randomly across the cortex and that smooth maps are a filtering artifact must be rejected on the basis of electrode penetration studies (Blasdel & Salama 1986) and would also be inconsistent with earlier data (e.g., Hubel & Wiesel 1974). Another piece of evidence against artifacts is the different wavelengths of the orientation maps from different animals, which indicate biological variability rather than methodological artifacts. The sign of a singularity is a nonlocal quantity, which is robust against noise and can be measured with high reliability from the data. This is no longer true for locations, because singularities reside in areas of the map where the optical signal is weak and measurements are affected by noise.
Singularities in Primate Orientation Maps
571
Figure 6: Top row: Doubly transformed amplitude spectra T(Et) of local patterns of singularities. The corresponding second patterns of Figure 5 were added along the scaling axis, and the summed amplitudes were plotted against the rotation axis. Only every second value is shown. Strong oscillations are present in the case of four- and sixfold rotational symmetry (“square” and “triangle”). Center: Templates for square and triangular patterns. The actual patterns were generated by random translation, rotation, and scaling of the templates.
Under the assumption that positional errors are biased for neither sign nor direction in the cortex, errors in location measurements tend to randomize the sign statistics and the distribution of distances. Thus, possible positional errors cannot invalidate correlations away from a random distribution. The spatial averaging inherent in optical recordings, however, may in principle introduce artifacts because it has the effect that singularities of opposite sign attract each other, while singularities of the same sign repel themselves. In order to explain the correlations of Table 3, however, averaging artifacts must be of a size that would lead to completely deformed patterns. This would be in contradiction to the good match between optical recordings and microelectrode data. Hence, we conclude that the measured correlations in sign and location are significant. Positional errors may explain why the distribution of distances between nearest neighbors deviates from the BFN predictions. If gaussian positional noise with a standard deviation of only 0.251 (which corresponds to 100 µm
572
K. Obermayer and G. G. Blasdel
in real data) is added to the location of the singularities of the BFN map, the peaks of the nearest and next nearest neighbor distributions shift to lower values and coincide with the experimental data. Consequently, one should not regard the differences between BFN and data as significant, given the precision of location measurement. Errors in location measurements tend to randomize local patterns but cannot fully explain the failure to find local geometric patterns of singularities. As long as positional errors are well below the average distance between singularities, diagrams similar to Figure 6 would still show oscillations, albeit with less amplitude. For larger positional errors, the signature of local patterns would be lost, but errors of this size would again be in contradiction with the good match between optical recording and electrode penetration studies. 5.2 Comparison Between Models and Data, and Biological Significance. The analysis of imaging data has shown that the spatial structure of orientation preference maps is somewhat in between the predictions of the BFN model and the FAM. The FAM predicts a pattern that is too smooth and is based on the wrong assumption that all singularities are isotropic. Consequently it is a good model only in regions where neither fractures nor anisotropic singularities occur. Those regions are typically below 1 mm in size. Long-range order, which has been found in maps from cat area 17 and 18 (Wolf et al. 1994b), could not be detected in our primate data. BFN maps do not obey the optimal smoothness criterion, and FAM reconstructions of BFN maps fail. From the fact that reconstruction errors are larger for BFN than for experimental maps, it can be concluded that BFN models fail to describe fully the orientation preference map locally. BFN models predict a higher number of fractures than have been found in the data. BFN maps, however, are consistent with the pattern of singularities seen. How do sign correlations emerge? Singularities are located at the intersections of zero crossings of the real and imaginary parts of the orientation map O. Singularities, which are neighbors along a particular zero crossing of either the real or the imaginary part of O, must have the opposite sign. The sign correlations thus emerge due to the fact that nearest neighbors usually lie on the same zero crossing. As a result of the sign statistics, BFN models also exhibit long-range correlations. Given the zeros of the imaginary and the real parts of O, the sign of any single singularity automatically determines the signs of all other singularities (Freund & Shvartsman 1994). At the moment we can only speculate about the biological significance of the sign statistics and the distribution of distances between nearest neighbors. There are at least two explanations. Sign correlations and the deviations from a random distribution are predicted by BFN and are therefore also a feature of simple Hebbian-type learning. But another explanation is also possible. Orientation maps in macaques show a fairly stable wavelength
Singularities in Primate Orientation Maps
573
λOR at a time when the visual cortex is still growing (Blasdel et al. 1995). A natural explanation of this phenomenon could be that during growth, singularities are born in opposite sign pairs that subsequently drift away from each other. The fact that singularities are born in pairs could explain the sign correlations, and the fact that singularities drift away could explain the apparent repulsion. Note that no trends have been observed between animals of different ages, which is in accordance with previous findings (Blasdel et al. 1995; Wiesel & Hubel 1974). More interesting, no trends have been observed between macaques, which show ocular dominance bands, and squirrel monkeys, which do not. This is surprising for two reasons. First, the twodimensional power spectra of orientation maps from squirrel monkeys are strongly anisotropic, while power spectra of macaque data are not. Second, singularities in macaques show correlations with ocular dominance bands (Blasdel 1992b; Obermayer & Blasdel 1993), and one would have expected differences in their distribution across the cortex when compared with maps from squirrel monkeys in which this constraint is not present. The fact that no differences are present could result from the fact that correlations between singularities and ocular dominance bands are too weak to be detected by the measures used in this study. Acknowledgments We are grateful to two anonymous reviewers for their helpful comments. This research was funded in part by the Human Frontier Science Program Organization (grant RG94-98) and Deutsche Forschungsgemeinschaft (grant OB 102/2-1). We thank V. Schwanenhorst for implementing some of the C programs for local pattern analysis. References Bartfeld, E., & Grinvald, A. (1992). Relationships between orientation-preference pinwheels cytochrome oxidase blobs, and ocular dominance columns in primate striate cortex. Proc. Natl. Acad. Sci., 89, 11905–11909. Blasdel, G. G. (1992a). Differential imaging of ocular dominance and orientation selectivity in monkey striate cortex. J. Neurosci., 12, 3117–3138. Blasdel, G. G. (1992b). Orientation selectivity, preference, and continuity in monkey striate cortex. J. Neurosci., 12, 3139–3161. Blasdel, G. G., & Salama, G. (1986). Voltage sensitive dyes reveal a modular organization in monkey striate cortex. Nature, 321, 579–585. Blasdel, G. G., Obermayer, K., & Kiorpes, L. (1995). Organization of ocular dominance and orientation columns in the striate cortex of neonatal macaque monkeys. Vis. Neurosci., 12, 589–603. Bonhoeffer, T., & Grinvald, A. (1991). Orientation columns in cat are organized in pinwheel like patterns. Nature, 353, 429–431.
574
K. Obermayer and G. G. Blasdel
Bonhoeffer, T., & Grinvald, A. (1996). Optical imaging based on intrinsic signals: The methodology. In A. W. Toga and J. C. Mazziotta (Eds.), Brain mapping: The methods. 1996 (pp. 55–97). San Diego: Academic Press. Cowan, J., & Friedman, A. (1991). Simple spin models for the development of ocular dominance columns and iso-orientation patches. In R. Lippman, J. Moody, and D. Touretzky (Eds.), Advances in neural information processing systems 3 (pp. 26–31). San Mateo, CA: Morgan Kaufmann. Erwin, E., Obermayer, K., & Schulten, K. (1995). Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neuro. Comp., 7, 425–468. Freund, I. (1994). Optical vortices in Gaussian random fields: Statistical probability densities. J. Opt. Soc. Am. A, 11, 1644–1652. Freund, I., & Shvartsman, N. (1994). Wave-field phase singularities: The sign principle. Phys. Rev. A, 50, 5164–5172. Hubel, D. H., & Wiesel, T. N. (1974). Sequence regularity and geometry of orientation columns in the monkey striate cortex. J. Comp. Neurol., 158, 267–294. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of orientation columns. Proc. Natl. Acad. Sci. USA, 83, 8779–8783. Miller, K. D. (1994). A model for the development of simple cell receptive fields and the ordered arrangements of orientation columns through activitydependent competition between ON- and OFF-center inputs. J. Neurosci., 14, 409–441. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Niebur, E., & Worg ¨ otter, ¨ F. (1994). Design principles of columnar organization in visual cortex. Neuro. Comp., 6, 602–614. Obermayer, K., & Blasdel, G. G. (1993). Geometry of orientation and ocular dominance columns in monkey striate cortex. J. Neurosci., 13, 4114–4129. Obermayer, K., Blasdel, G. G., & Schulten, K. (1992). A statistical mechanical analysis of self-organization and pattern formation during the development of visual maps. Phys. Rev. A, 45, 7568–7589. Obermayer, K., Ritter, H., & Schulten, K. (1990). A principle for the formation of the spatial structure of cortical feature maps. Proc. Natl. Acad. Sci. USA, 87, 8345–8349. Rojer, A. S., & Schwartz, E. L. (1990). Cat and monkey cortical columnar patterns modeled by bandpass-filtered 2d white noise. Biol. Cybern., 62, 381–391. Schwartz, E. L. (1994). Computational studies of the spatial architecture of primate visual cortex. In A. Peters & K. S. Rockland (Eds.), Cerebral cortex (Vol. 10). New York: Plenum Press. Shvartsman, N., & Freund, I. (1994). Vortices in random wave fields: Nearest neighbor anticorrelations. Phys. Rev. Lett., 72, 1008–1011. Swindale, N. V. (1982). A model for the formation of orientation columns. Proc. R. Soc. Lond. B, 215, 211–230. Wiesel, T. N., & Hubel, D. H. (1974). Ordered arrangement of orientation columns in monkeys lacking visual experience. J. Comp. Neurol., 158, 307– 318.
Singularities in Primate Orientation Maps
575
Wolf, F., Pawelzik, K., & Geisel, T. (1994a). Emergence of long-range order in maps of orientation preference. In M. Marinaro & P. G. Morasso (Eds.), Artificial neural networks I (pp. 38–41). New York: Elsevier Science Publishers. Wolf, F., Pawelzik, K., Geisel, T., Kim, D. S., & Bonhoeffer, T. (1994b). Optimal smoothness of orientation preference maps. In F. H. Eeckman (Ed.), Computation in neurons and neural systems (pp. 97–101). Newton, MA: Kluwer Academic Publishers.
Received July 25, 1995; accepted June 24, 1996.
Communicated by Ken Miller
Topographic Receptive Fields and Patterned Lateral Interaction in a Self-Organizing Model of the Primary Visual Cortex Joseph Sirosh HNC Software Inc., 5930 Cornerstone Court West, San Diego, CA 92121 USA
Risto Miikkulainen Department of Computer Sciences, University of Texas at Austin, Austin, TX 78712 USA
This article presents a self-organizing neural network model for the simultaneous and cooperative development of topographic receptive fields and lateral interactions in cortical maps. Both afferent and lateral connections adapt by the same Hebbian mechanism in a purely local and unsupervised learning process. Afferent input weights of each neuron selforganize into hill-shaped profiles, receptive fields organize topographically across the network, and unique lateral interaction profiles develop for each neuron. The model demonstrates how patterned lateral connections develop based on correlated activity and explains why lateral connection patterns closely follow receptive field properties such as ocular dominance. 1 Introduction The response properties of neurons in many sensory cortical areas are ordered topographically; that is, nearby neurons respond to nearby areas of the receptor surface. Such topographic maps form by the self-organization of afferent connections to the cortex, driven by external input (Hubel & Wiesel 1965; Miller et al. 1989; Stryker et al. 1988; von der Malsburg 1973). Several neural network models have demonstrated how global topographic order can emerge from local cooperative and competitive lateral interactions within the cortex (Amari 1980; Kohonen 1982, 1993; Miikkulainen 1991; Willshaw & von der Malsburg 1976). These models are based on predetermined lateral interaction and focus on explaining how the afferent synaptic weights are organized. A number of recent neurobiological experiments indicate that lateral connections self-organize like the afferent connections: 1. The lateral connectivity is not uniform or genetically predetermined Neural Computation 9, 577–594 (1997)
c 1997 Massachusetts Institute of Technology °
578
Joseph Sirosh and Risto Miikkulainen
but forms during early development based on external input (Katz & Callaway 1992; Lowel ¨ & Singer 1992). 2. In the primary visual cortex, lateral connections are initially widespread, then develop into clustered patches. The clustering period overlaps substantially with the period during which ocular dominance and orientation columns form (Katz & Callaway 1992; Dalva & Katz 1994; Burkhalter et al. 1993). 3. Lateral connections primarily connect areas with similar response properties, such as columns with the same orientation or (in the strabismic case) eye preference (Gilbert 1992; Lowel ¨ & Singer 1992). 4. The lateral connections are far more numerous than the afferents and are believed to have a substantial influence on cortical activity (Gilbert et al. 1990). To account fully for cortical self-organization, a cortical map model must demonstrate that both afferent and lateral connections can organize simultaneously, from the same external input, and in a mutually supportive manner. We have previously shown how Kohonen’s self-organizing feature maps (Kohonen 1982) can be generalized to include self-organizing lateral connections (the laterally interconnected synergetically self-organizing map, LISSOM; Sirosh & Miikkulainen 1993, 1994). LISSOM is a low-dimensional abstraction of the cortical self-organizing process and models a small region of the cortex where all neurons receive the same input vector. In contrast, this article shows how realistic, high-dimensional receptive fields develop as part of the self-organization and in essence scales up the LISSOM approach to large areas of the cortex where different parts of the cortical network receive inputs from different parts of the receptor surface. This receptive field LISSOM model shows how (1) topographically ordered receptive fields develop from simple retinal images, (2) lateral connections self-organize cooperatively and simultaneously with the afferents, (3) long-range lateral connections store correlations in activity across the topographic map, and (4) the resulting lateral connection patterns closely follow receptive field properties such as ocular dominance. 2 The Receptive Field LISSOM Model The cortical network is modeled as a sheet of neurons interconnected by short-range excitatory lateral connections and long-range inhibitory lateral connections (see Figure 1). Neurons receive input from a receptive surface, or “retina,” through the afferent connections. These connections come from overlapping patches on the retina called anatomical receptive fields (RFs). The patches are distributed with a given degree of randomness. The N × N network is projected on the retina of R × R receptors, and each neuron is assigned a receptive field center (c1 , c2 ) randomly within a radius ρ ∗ R
Topographic Receptive Fields and Patterned Lateral Interaction
579
Figure 1: Architecture of the self-organizing network. The lateral excitatory and lateral inhibitory connections of a single neuron in the network are shown, together with its afferent connections. The afferents form a local anatomical receptive field on the retina.
(ρ ∈ [0, 1]) of the neuron’s projection. Through the afferent connections, the neuron receives input from receptors in a square area around the center with side s. Depending on its location, the number of afferents to a neuron could vary from 12 s × 12 s (at the corners) to s × s (at the center). The external and lateral weights are organized through an unsupervised learning process. At each training step, the neurons start out with zero activity. The initial response ηij of neuron (i, j) is based on the scalar product à ηij = σ
X
r1 ,r2
! ξr1 ,r2 µij,r1 r2 ,
(2.1)
where ξr1 ,r2 is the activation of a retinal receptor (r1 , r2 ) within the receptive field of the neuron, µij,r1 r2 is the corresponding afferent weight, and σ is a piecewise linear approximation of the familiar sigmoid activation function, x≤δ 0 (x − δ)/(β − δ) δ < x < β, σ (x) = (2.2) 1 x≥β where δ and β are the lower and upper thresholds. The response evolves over time through lateral interactions. At each time step, the neuron combines
580
Joseph Sirosh and Risto Miikkulainen
retinal activation with lateral excitation and inhibition, Ã ηij (t) = σ
X
r1 ,r2
ξr1 ,r2 µij,r1 r2 + γe
−γi
X k,l
X
Eij,kl ηkl (t − δt) !
Iij,kl ηkl (t − δt) ,
(2.3)
k,l
where Eij,kl is the excitatory lateral connection weight on the connection from neuron (k, l) to neuron (i, j), Iij,kl is the inhibitory connection weight, and ηkl (t−δt) is the activity of neuron (k, l) during the previous time step. All connection weights are positive. The input activity stays constant, while the neural activity settles. The scaling factors γe and γi determine the strength of the lateral excitatory and inhibitory interactions. The activity pattern starts out diffuse and spread over a substantial part of the map, but within a few iterations of equation 2.3, it converges into a stable, focused patch of activity, or activity bubble. After the activity has settled, the connection weights of each neuron are modified. Both afferent and lateral connection weights adapt according to the same mechanism, the Hebb rule, normalized so that the sum of the weights is constant: wij,mn (t + 1) = P
wij,mn (t) + αηij Xmn £ ¤, mn wij,mn (t) + αηij Xmn
(2.4)
where ηij stands for the activity of the neuron (i, j) in the settled activity bubble, wij,mn is the afferent or the lateral connection weight (µij,r1 r2 , Eij,kl or Iij,kl ), α is the learning rate for each type of connection (αA for afferent weights, αE for excitatory, and αI for inhibitory), and Xmn is the presynaptic activity (ξr1 ,r2 for afferent, ηkl for lateral). Afferent inputs, lateral excitatory inputs, and lateral inhibitory inputs are normalized separately. The larger the product is of the pre- and postsynaptic activity ηij Xmn , the larger the weight change. Therefore, both excitatory and inhibitory connections strengthen by correlated activity; normalization then redistributes the changes so that the sum of each weight type for each neuron remains constant. 3 Self-Organization Although the self-organizing mechanism outlined above is robust, the width of the anatomical receptive fields and how ordered they are strongly affect the outcome of the process. In a series of simulations, the conditions under which self-organization could take place were studied, using networks with various receptive field widths and varying degrees of initial order. In each case, all synaptic weights were initially random: a uniformly distributed random value between zero and one was assigned to all weights, and the
Topographic Receptive Fields and Patterned Lateral Interaction
581
total weight of each connection type of each neuron was normalized to 1.0.1 Gaussian spots of “light” on the retina were used as input. At each presentation, the activation ξr1 ,r2 of the receptor (r1 , r2 ) was given by ξr1 ,r2 =
n X i=1
µ
(r1 − xi )2 + (r2 − yi )2 exp − a2
¶ ,
(3.1)
where n is the number of spots, a2 specifies the width of the gaussian, and the spot centers (xi ,yi ): 0 ≤ xi , yi < R, were chosen randomly. When the networks were trained with single light spots (n = 1), similar afferent and lateral connection structures developed in all cases. With more realistic input consisting of multiple light spots (n > 1), the networks with wide anatomical receptive fields and relatively small topographic scatter self-organized just as robustly. Figures 2 through 4 illustrate the selforganization of such a network. The initial rough pattern of afferent weights of each neuron evolved into a hill-shaped profile (see Figure 2). The afferent weight profiles of different neurons peaked over different parts of the retina, and their center of gravities (calculated in retinal coordinates) formed a topographical map (see Figure 3). Networks with large topographic scatter compared to the RF size, however, failed to develop global order with multispot inputs. It is interesting to analyze why. With single light spots, the afferent weights of an active neuron always change toward a single, local input pattern. Eventually these weights become concentrated around a local area in the RF such that the global distribution of the centers best approximates the input distribution. However, when multiple spots occur in the receptive field at the same time, the weights change toward several different locations. If the scatter is large and the neighboring receptive fields do not overlap substantially, the inputs in the nonintersecting areas change the weights the most, and the receptive fields remain crossed. If the scatter is small and the overlap high, the input activity in the intersecting region becomes strong enough to drive the weights toward topographic order. The model therefore suggests that for the self-organization of afferent connections in the cortex to be purely activity driven, the anatomical receptive fields must be large compared to the amount of initial topographic scatter among their centers.2 1 Various schemes for initializing afferent connections have been studied by other researchers. These schemes include tagging connections with chemical markers (Willshaw & von der Malsburg 1979) and establishing an initial topographic bias (Willshaw & von der Malsburg 1976; Goodhill 1993). Our focus is on receptive field width versus initial order because this factor has turned out to be most crucial in determining the success of the self-organizing process in the model presented here. 2 Even with large scatter, it is still possible to establish topographic order with additional mechanisms such as chemical markers (Willshaw & von der Malsburg 1979) or special initialization schemes (Willshaw & von der Malsburg 1976). Also, computation-
582
Joseph Sirosh and Risto Miikkulainen
(a)
Figure 2: Self-organization of the afferent input weights. The afferent weights of five neurons (located at the center and at the four corners of the network) are superimposed on the retinal surface in this figure. The retina had 21 × 21 receptors, and the receptive field radius was chosen to be 8. Therefore, neurons could have anywhere from 8 × 8 to 17 × 17 afferents depending on their distance from the network boundary. (a) Initial random weights. The anatomical RF centers were slightly scattered around their topographically ordered positions (uniformly, within a distance of 0.5 in retinal coordinates), and the weights were initialized randomly (as discussed in the text). There are four concentrated areas of weights slightly displaced from the corners and one larger one in the middle. At the corners, the profiles are taller because the normalization divides the total afferent weight among a smaller numer of connections. Continued.
The lateral connections evolve together with the afferents. By the normalized Hebbian rule (see equation 2.4), the lateral connection weights of each neuron are distributed according to how well the neuron’s activity correlates with the activities of the other neurons. As the afferent receptive fields organize into a uniform map (see Figure 3), these correlations fall ally there is no restriction on how large the receptive fields can be. They can cover the whole retina, and global order will develop because such receptive fields overlap a lot (i.e., completely). Of course, anatomically such full connectivity would not be realistic.
Topographic Receptive Fields and Patterned Lateral Interaction
583
(b)
Figure 2: Continued. (b) Final organized receptive fields. As the self-organization progresses, the weights organize into smooth hill-shaped profiles. In this simulation, each input consisted of three randomly located gaussian spots with a = 2.0. The lateral interaction strengths were γe = γi = 0.9, the learning rates αA = αE = αI = 0.002, and the upper and lower thresholds of the sigmoid 0.65 and 0.1. The map was formed in 10,000 training presentations.
off with distance approximately like a gaussian, with strong correlations to near neighbors and weaker correlations to more distant neurons. The lateral excitatory and inhibitory connections acquire the gaussian shape, and the combined lateral excitation and inhibition becomes an approximate difference of gaussians (or a “Mexican hat”; see Figure 4). 4 Ocular Dominance and Lateral Connection Patterns The activity patterns and correlations in the retina are not as simple or random as in the above experiments. Also, the visual cortex has a complex organization of orientation and ocular dominance columns, and this organization is reflected in the patterns of lateral connections. As a first step to studying how afferent and lateral connections evolve in the visual cortex, the development of ocular dominance was simulated. A second retina was added to the model, and afferent connections were set up exactly as
584
Joseph Sirosh and Risto Miikkulainen
(a)
(b)
Figure 3: Self-organization of afferent receptive fields into a topographic map. The center of gravity of the afferent weight vector of each neuron in the 42 × 42 network is projected onto the receptive surface (represented by the square). Each center of gravity point is connected to those of the four immediately neighboring neurons by a line. The resulting grid depicts the topographical organization of the map. Initially, the anatomical RF centers were only slightly scattered topographically, but because the afferent weights were initially random, the centers of gravity are scattered much more (a). As the self-organization progresses, the map unfolds and the weight vectors spread out to form a regular topographic map of the receptive surface, as shown in (b). Because the anatomical receptive fields are large in this example compared to the initial topographic scatter, the self-organization is robust with both single-spot and multispot inputs.
for the first retina (with local receptive fields and topographically ordered RF centers). Multiple gaussian spots were presented in each eye as input. Because of cooperation and competition between inputs from the two eyes, groups of neurons developed strong afferent connections to one eye or the other, resulting in patterns of ocular dominance in the network (cf. von der Malsburg 1990; Miller et al. 1989). The self-organization of the network was studied with varying betweeneye correlations. At each input presentation, one spot is randomly placed at (xi ,yi ) in the left retina, and a second spot within a radius of c × R of (xi , yi ) in the right retina. The parameter c ∈ [0, 1] specifies the spatial correlations between spots in the two retinas and can be adjusted to simulate different degrees of correlations between images in the two eyes. Multispot images can be generated by repeating the above step: the simulations that follow used two-spot images in each eye. Two different simulations are illustrated. In the first, there were no between-eye correlations in the input (c = 1),
Topographic Receptive Fields and Patterned Lateral Interaction
585
(a)
Figure 4: Self-organization of the lateral interaction. The lateral interaction profile for a neuron at position (20, 20) in the 42 × 42 network is plotted. The excitation and inhibition weights are initially randomly distributed within radii 3 and 18. The combined interaction, the sum of the excitatory and inhibitory weights, illustrates the total effect of the lateral connections. The sums of excitation and inhibition were chosen to be equal, but because there are fewer excitatory connections, the interaction has the shape of a rough plateau with a central peak (a). Continued.
thus simulating strabismus, where prominent ocular dominance patterns are known to develop. In the second case, there were positive between-eye correlations (c < 1), simulating the case of normal binocular vision. Figure 5 illustrates the response properties and lateral connectivity in the strabismic case. The connection patterns closely follow ocular dominance organization. As neurons become better tuned to one eye or the other, activity correlations between regions tuned to the same eye become stronger, and correlations between opposite eye areas become weaker. As a result, monocular neurons develop strong lateral connections to regions with the same eye preference and weak connections to regions of opposite eye preference. Most neurons are monocular, but a few binocular neurons occur at the boundaries of ocular dominance regions. Because they are equally tuned to the two eyes, the binocular neurons have activity correlations with
586
Joseph Sirosh and Risto Miikkulainen
(b)
Figure 4: Continued. During self-organization, smooth patterns of excitatory and inhibitory weights evolve, resulting in a smooth Mexican hat–shaped lateral interaction profile (b).
both ocular dominance regions, and their lateral connection weights are distributed more or less equally between them. Figure 6 shows the ocular dominance and lateral connection patterns for the normal case. The ocular dominance stripes are narrower, and there are more ocular dominance columns in the network. Most neurons are neither purely monocular nor purely binocular, and few cells have extreme values of ocular dominance. Accordingly, the lateral connectivity in the network is only partially determined by ocular dominance. However, the lateral connections of the few strongly monocular neurons follow the ocular dominance patterns as in the strabismic case. In both cases, the spacing between the lateral connection clusters matches the stripe width. The patterns of lateral connections and ocular dominance outlined closely match observations in the primary visual cortex. Lowel ¨ & Singer (1992) observed that when between-eye correlations were abolished in kittens by surgically induced strabismus, long-range lateral connections primarily linked areas of the same ocular dominance. However, binocular neurons, located between ocular dominance columns, retained connections to both eye regions. The ocular dominance stripes in the strabismics were broad
Topographic Receptive Fields and Patterned Lateral Interaction
(a)
587
(b)
Figure 5: Ocular dominance and patterned long-range lateral connections in the strabismic case. Each neuron is colored with a gray-scale value (black → white) that represents continuously changing eye preference from exclusive left, through binocular, to exclusive right. Most neurons are monocular, so white and black predominate. Small white dots indicate the strongest lateral input connections to the neuron marked with a big white dot. Only the long-range inhibitory connections are shown. The excitatory connections link each neuron only to itself and its eight nearest neighbors. (a) The lateral connections of a left monocular neuron predominantly link areas of the same ocular dominance. (b) The lateral connections of a binocular neuron come from both eye regions. In this simulation, the parameters were: network size, N = 64; retinal size, R = 24; afferent field size, s = 9, δ = 0.1, β = 0.65; spot width, a = 5.0; excitation radius, d = 1; inhibition radius = 31; scaling factors γe = 0.5, γi = 0.9; learning rates αA = αE = αI = 0.002; number of training iterations = 35, 000. The anatomical RF centers were slightly scattered around their topographically ordered positions (radius of scatter = 0.5, as in Figure 2), and all connections were initialized to random weights.
and sharply defined (Lowel ¨ 1994). In contrast, ocular dominance stripes in normal animals were narrow and less sharply defined, and lateral connection patterns overall were not significantly influenced by ocular dominance. The receptive field model reproduces these experimental results and also predicts that the lateral connections of strongly monocular neurons would follow ocular dominance even in the normal case. The model therefore confirms that patterned lateral connections develop based on correlated neuronal activity and demonstrates that they can self-organize cooperatively with ocular dominance columns.
588
Joseph Sirosh and Risto Miikkulainen
(a)
(b)
Figure 6: Ocular dominance and patterned long-range lateral connections in the normal case. The simulation was otherwise identical to that of Figure 5, except that between-eye correlations were greater than zero (c = 0.4). The stripes are narrower, and most neurons have intermediate values of ocular dominance (colored gray). Their lateral connections are only partially influenced by ocular dominance, as in (a). However, as shown in (b), the lateral connections of the most strongly monocular neurons reflect the ocular dominance organization as in the strabismic case. If the between-eye correlations are increased to one (perfectly matched spots in the two retinas), ocular dominance properties will disappear, and simple topographic maps and Mexican hat lateral interaction will result, as in Figures 2 through 4.
5 Computational Predictions of the Model RF-LISSOM is a computational model of known physiological phenomena, and it can be used to identify computational constraints that must be met for these phenomena to occur. Below, adaptation and extent of lateral excitation and inhibition, and correlation versus lateral excitation, are discussed as possible such constraints. Other computational details are included in the appendix. 5.1 Adapting Lateral Excitation and Inhibition. For proper topographic order to develop in our model, the initial lateral excitatory radius must be large enough that the network produces a single localized spot of activity for each input spot. Otherwise spurious correlations are introduced into the self-organizing process, and the map is likely to develop twists. A good general guideline is that the initial radius of excitation should be comparable
Topographic Receptive Fields and Patterned Lateral Interaction
589
to the range of activity correlations in the network. However, the excitatory radius need not be fixed. As in the self-organizing map algorithm (Kohonen 1982), the radius may start out large and gradually decrease until it covers the nearest neighbors only. Decreasing the radius in this manner makes the receptive fields gradually narrower and results in finer topographic order. This is a very robust effect and has been observed in numerous experiments. Even at a fixed excitatory radius, the lateral weights adapt, with a similar effect as adapting the excitatory radius. If the input correlations are short range, the lateral excitatory profile will become more peaked, and the inhibition will become stronger in the vicinity of each neuron (see Figure 4). Therefore, the activity patterns in the network will become smaller and more focused, as happens when the excitatory radius decreases. As a result, the receptive fields will become narrower. For example, small gaussian input blobs will produce both short-range activity correlations in the network and smaller receptive fields. The range of lateral inhibition is not a crucial parameter as long as it is greater than approximately twice the excitatory radius. If the range is less, activity patterns often split up into several separate bubbles, and distorted maps form. The weak long-range connections may be repeatedly pruned away during self-organization, as in the LISSOM model (Sirosh & Miikkulainen 1994). By weight normalization, such pruning will increase the strength of the surviving connections, thereby increasing the inhibition between the correlated areas. As a result, the activity patterns will become better focused, and more specific maps will develop. 5.2 Long-range Inhibition versus Excitation. For the self-organization of receptive fields to occur, our model requires that long-range lateral interactions be inhibitory. This is an important prediction of the model. The inhibitory interactions prevent the neural activity from spreading and saturating. As the lateral connections adapt, it is necessary that the lateral inhibition strengthens by correlated activity. If it was weakened instead, inhibition would concentrate on those units that are the least active, in effect, self-organizing itself out of the system. The activity patterns would then spread out and saturate, and highly distorted maps would result. Anatomical studies in the visual cortex show that long-range horizontal connections have mostly excitatory synapses and link primarily to other excitatory neurons (Kisvarday & Eysel 1992; Hirsch & Gilbert 1991; Gilbert et al. 1990). The crucial question is whether afferent stimulation produces mainly excitatory or inhibitory lateral responses at long range. Partial experimental support for long-range inhibition comes from optical imaging studies, which indicate that substantial lateral inhibition exists even at distances of up to 6 mm in the primary visual cortex (Grinvald et al. 1994). This range coincides well with the extent of long-range horizontal connections. Other studies, such as that of Nelson and Frost (1978), have also reported long-range inhibitory interactions without clearly identifying the anatomi-
590
Joseph Sirosh and Risto Miikkulainen
cal substrate. Due to insufficient evidence, however, the question of whether the primary effect is inhibitory or excitatory remains controversial. Until this issue is resolved and the underlying cortical circuitry is delineated, it will be difficult to determine if and how lateral inhibition in the cortex strengthens by correlated activity, as predicted by the model. 5.3 Factors Affecting the Development of Ocular Dominance. Four main factors influence how ocular dominance columns form: (1) the network size N, (2) the retina size R, (3) the excitation radius d, and (4) the between-eye correlation parameter c. For given values of the network parameters N, R, and d, the degree of between-eye correlation determines whether ocular dominance columns form. In simulations with small values of c (closely matched images in both eyes), only topographic maps develop. As c is gradually increased from zero, a sharp transition from a simple topographic map to an ocular dominance map occurs at a threshold ct ∈ (0, 1), and stable ocular dominance patterns appear. For example, in the simulations illustrated in Figures 5 and 6, N = 64, R = 24, and d = 1, and the threshold at which ocular dominance columns first appear is approximately ct = 0.25. As the network parameters change, ct appears to change in a fairly simple manner. The threshold is approximately the same for all networks for which the ratio dR/N, or the excitatory radius in retinal units, is the same. As this ratio increases (as when the excitation radius increases or the network becomes smaller), the relative size of activity bubbles on the network becomes larger, increasing correlation and causing transition to ocular dominance at higher levels of c. This is analogous to the self-organizing maps (Kohonen 1982), where an input dimension (i.e., ocular dominance) becomes represented on the map only when the variance in that dimension (c) grows large enough compared to the neighborhood radius (dR/N). Whether the lateral connections adapt is not important for the development of ocular dominance. Well-defined stripes can form even if the lateral weights are fixed, although such stripes are broader than those developed with self-organizing lateral connections. Similarly, ocular dominance columns form under a variety of boundary conditions. With self-organized lateral connections, they form at least with sharp and periodic retinal and network boundaries and with smaller or same-size anatomical RFs at the boundary, although the patterns of the stripes may be qualitatively different in each case. 6 Future Work Perceptual grouping rules such as continuity of contours, proximity, and smoothness appear as long-range activity correlations in the cortex. Von der Malsburg and Singer (1988) and Singer et al. (1990) have suggested that lateral connections store these correlations and perform perceptual grouping and segmentation by synchronizing cortical activity. The receptive field
Topographic Receptive Fields and Patterned Lateral Interaction
591
LISSOM model could be extended to test this hypothesis computationally. So far, we have demonstrated how the inhibitory lateral connections of LISSOM learn to store long-range activity correlations. On the other hand, models such as that of von der Malsburg and Buhmann (1992) show that neuronal firing can be synchronized quite effectively by inhibitory lateral interactions. These two results could be brought together into a LISSOM architecture using neurons with realistic firing properties and used to study segmentation and binding phenomena. In future research, we intend to work on self-organization in networks with firing neurons, study the knowledge extracted by lateral connections from realistic input, and determine how this knowledge could be used in feature grouping and segmentation. 7 Conclusion The receptive field LISSOM model demonstrates that cortical self-organization can be based on short-range excitation, long-range inhibition, and Hebbian weight adaptation. Starting from random-strength connections, neurons develop localized receptive fields and lateral interaction profiles cooperatively and simultaneously. Self-organization takes place with multiple retinal inputs and with multiple anatomical receptive fields of varying sizes, but requires that the receptive fields be large compared to the degree of scatter among their centers. With inputs coming from two retinas at the same time, ocular dominance stripes develop, and lateral connections primarily connect areas of the same ocular dominance. When input correlations between eyes increase, the ocular dominance stripes become narrower, and the influence of ocular dominance on the lateral connection patterns decreases. The model therefore suggests that lateral connections have a dual role in the cortex: to support self-organization of receptive fields and to learn long-range feature correlations, which may serve as a basis for perceptual grouping and segmentation. Appendix: Simulation Parameters The main parameters that influence self-organization are the sigmoid activation function and the strength of the recurrent lateral excitation and inhibition γe and γi . The activation function (see equation 2.2) affects the self-organizing process in two ways. First, the lower threshold δ determines the minimum input required for a neuron to participate in weight modification. If the weighted sum is below δ, the output activity ηij = 0, and the weights will not be modified. Only input vectors close enough to the neuron’s weight vector cause it to adapt; the higher the δ, the smaller the area of the input space stimulating the neuron. Second, for a given δ, the upper threshold β determines the gain of the neuron activation. If β is close to δ, the neuron is more nonlinear, and small differences in its input can produce
592
Joseph Sirosh and Risto Miikkulainen
large changes in its output. When β is lower, the map amplifies smaller differences in the activity pattern. Therefore, the selectivity of receptive fields can be controlled by adjusting δ and β. Good maps are usually obtained for values of δ from 0.0 to 0.2 and β from 0.55 to 0.8. The actual parameters used in each simulation are shown in the figure captions. For a given sigmoid, if the lateral excitation is too strong (high value of γe ), saturated activity spots form, and distorted maps result. If γe is too small or γi is too large, the activity spots do not become sufficiently focused, and topographic maps do not develop. When the sigmoid slope is in the range 1.0 to 2.0, the excitation radius is d = 4, and the inhibition radius is greater than 2d, the operating ranges are typically 0.8 ≤ γe ≤ 1.1 and 0.7 ≤ γi ≤ 1.4. If the excitation radius is decreased, individual weights will grow stronger because the total lateral excitation weight of each neuron is normalized to 1.0. As a result, the excitation may become too strong. To offset this problem, the value of γe will have to decrease with the excitation radius. At the excitation radius of d = 1, the operating range of γe was 0.4 ≤ γe ≤ 0.65. Acknowledgments. This research was supported in part by the National Science Foundation under grant IRI-9309273. Computing time for the simulations was provided by the Pittsburgh Supercomputing Center under grants IRI930005P and TRA940029P.
References Amari, S.-I. (1980). Topographic organization of nerve fields. Bulletin of Mathematical Biology, 42, 339–364. Burkhalter, A., Bernardo, K. L., & Charles, V. (1993). Development of local circuits in human visual cortex. Journal of Neuroscience, 13, 1916–1931. Dalva, M. B., & Katz, L. C. (1994). Rearrangements of synaptic connections in visual cortex revealed by laser photostimulation. Science, 265, 255–258. Gilbert, C. D. (1992). Horizontal integration and cortical dynamics. Neuron, 9, 1–13. Gilbert, C. D., Hirsch, J. A., & Wiesel, T. N. (1990). Lateral interactions in visual cortex. In Cold Spring Harbor Symposia on Quantitative Biology, 55, 663–677. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. Goodhill, G. (1993). Topography and ocular dominance: A model exploring positive correlations. Biological Cybernetics, 69, 109–118. Grinvald, A., Lieke, E. E., Frostig, R. D., & Hildesheim, R. (1994). Cortical pointspread function and long-range lateral interactions revealed by real-time optical imaging of macaque monkey primary visual cortex. Journal of Neuroscience, 14, 2545–2568. Hirsch, J. A., & Gilbert, C. D. (1991). Synaptic physiology of horizontal connections in the cat’s visual cortex. Journal of Neuroscience, 11, 1800–1809.
Topographic Receptive Fields and Patterned Lateral Interaction
593
Hubel, D. H., & Wiesel, T. N. (1965). Receptive fields and functional architecture in two nonstriate visual areas (18 and 19) of the cat. Journal of Neurophysiology, 28, 229–289. Katz, L. C., & Callaway, E. M. (1992). Development of local circuits in mammalian visual cortex. Annual Review of Neuroscience, 15, 31–56. Kisvarday, Z. F., & Eysel, U. T. (1992). Cellular organization of reciprocal patchy networks in layer iii of cat visual cortex (area 17). Neuroscience, 46, 275–286. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Kohonen, T. (1993). Physiological interpretation of the self-organizing map algorithm. Neural Networks, 6, 895–905. Lowel, ¨ S. (1994). Ocular dominance column development: Strabismus changes the spacing of adjacent columns in cat visual cortex. Journal of Neuroscience, 14(12), 7451–7468. Lowel, ¨ S., & Singer, W. (1992). Selection of intrinsic horizontal connections in the visual cortex by correlated neuronal activity. Science, 255, 209–212. Miikkulainen, R. (1991). Self-organizing process based on lateral inhibition and synaptic resource redistribution. In Proceedings of the International Conference on Artificial Neural Networks (Espoo, Finland) (pp. 415–420). Amsterdam and New York: North-Holland. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Nelson, J., & Frost, B. (1978). Orientation-selective inhibition from beyond the classical visual receptive field. Brain Research, 139, 359–365. Singer, W., Gray, C., Engel, A., Konig, ¨ P., Artola, A., & Brocher, ¨ S. (1990). Formation of cortical cell assemblies. In Cold Spring Harbor Symposia on Quantitative Biology, 55, 939–952. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory. Sirosh, J., & Miikkulainen, R. (1993). How lateral interaction develops in a selforganizing feature map. In Proceedings of the IEEE International Conference on Neural Networks (San Francisco, CA), (pp. 1360–1365). Piscataway, NJ: IEEE. Sirosh, J., & Miikkulainen, R. (1994). Cooperative self-organization of afferent and lateral connections in cortical maps. Biological Cybernetics, 71(1), 66–78. Stryker, M., Allman, J., Blakemore, C., Gruel, J., Kaas, J., Merzenich, M., Rakic, P., Singer, W., Stent, G., van der Loos, H., & Wiesel, T. (1988). Group report: Principles of cortical self-organization. In P. Rakic & W. Singer (Eds.), Neurobiology of neocortex (pp. 115–136). New York: Wiley. von der Malsburg, C. (1973). Self-organization of orientation-sensitive cells in the striate cortex. Kybernetik, 15, 85–100. von der Malsburg, C. (1990). Network self-organization. In S. F. Zornetzer, J. L. Davis, & C. Lau (Eds.), An introduction to neural and electronic networks (pp. 421–432). New York: Academic Press. von der Malsburg, C., & Buhmann, J. (1992). Sensory segmentation with coupled neural oscillators. Biological Cybernetics, 67, 233–242. von der Malsburg, C., & Singer, W. (1988). Principles of cortical network organization. In P. Rakic & W. Singer (Eds.), Neurobiology of neocortex (pp. 69–99). New York: Wiley.
594
Joseph Sirosh and Risto Miikkulainen
Willshaw, D. J., & von der Malsburg, C. (1976). How patterned neural connections can be set up by self-organization. Proceedings of the Royal Society of London B, 194, 431–445. Willshaw, D. J., & von der Malsburg, C. (1979). A marker induction mechanism for the establishment of ordered neural mappings: Its application to the retinotectal problem. Philosophical Transactions of the Royal Society of London Biol., 287, 203–243.
Received January 10, 1994; accepted May 2, 1996.
Communicated by Ralph Linsker
The Formation of Topographic Maps That Maximize the Average Mutual Information of the Output Responses to Noiseless Input Signals Marc M. Van Hulle K.U. Leuven, Laboratorium voor Neuro- en Psychofysiologie, Leuven, Belgium
This article introduces an extremely simple and local learning rule for topographic map formation. The rule, called the maximum entropy learning rule (MER), maximizes the unconditional entropy of the map’s output for any type of input distribution. The aim of this article is to show that MER is a viable strategy for building topographic maps that maximize the average mutual information of the output responses to noiseless input signals when only input noise and noise-added input signals are available.
1 Introduction Ralph Linsker was among the first to introduce information-theoretic principles into the field of self-organizing neural networks. He devised a global learning rule for topographic map formation by maximizing the average mutual information (Shannon information rate) between the output and the signal part of the input, which is corrupted by noise (Linsker 1989). He argued that mutual information maximization is an optimal way to extract statistically salient input features. Others have devised learning rules from computational principles that are different yet related to the former: (1) principal component analysis (Foldi´ ¨ ak 1989; Rubner & Schulten 1990; Sanger 1989)—and its nonlinear extension, which is modeled by Kohonen’s selforganizing (feature) map (SOM) algorithm (Ritter et al. 1992)—as a way to account for most of the input’s variance but not for the noise, and (2) smoothing and predictive filtering (Atick & Redlich 1991), for which the limiting case approximates mutual information maximization. Linsker also introduced a two-stage, local learning rule for maximizing the average mutual information when the input distribution is multivariate gaussian (Linsker 1992). The aim of this article is to introduce a simple learning strategy for building topographic maps that maximize the average mutual information of the map outputs, in response to noiseless inputs, by using input noise and noise-added input signals only. The application envisaged is density estimation. Neural Computation 9, 595–606 (1997)
c 1997 Massachusetts Institute of Technology °
596
Marc M. Van Hulle
2 Maximum Entropy Learning Rule The aim of topographic map formation is to develop, in an unsupervised way, a mapping from a higher d-dimensional space V of input signals onto an equal or lower-dimensional discrete lattice A of N formal neurons. To each formal neuron i ∈ A corresponds a unique weight vector wi = [wi1 , . . . , wid ] defined in V-space. As a result of this mapping, the formal neurons quantize the input space V, with probability density function (p.d. f.) p(v), into a discrete number of partition cells or quantization regions. The definition of a quantization region is, in this case, different from the one used in a Voronoi (Dirichlet) tessellation. Assume that V and A have the same dimensionality d and that A comprises a periodic and regular d-dimensional lattice with a rectangular topology. Since the topology is rectangular, the lattice consists of a number of d-dimensional quadrilaterals (see Figure 1A). For example, in Figure 1B, the quadrilateral labeled He is defined by the neurons i, j, k, and m: the weights of neurons i, j, k, and m delimit a region in V-space. Hence, each quadrilateral represents a “granular” quantization region in V-space. We also consider the “imaginary” quadrilaterals facing the outer border of the lattice that represent the “overload” quantization regions (see Figure 1B). To each of the Q quadrilaterals of A, we associate a code membership function: ( 1lHj (v) =
2d nHj
0
if v ∈ Hj
(2.1)
if v 6∈ Hj ,
with nHj , 1 ≤ nHj ≤ 2d , the number of vertices of Hj . We assume that p(v) is stationary and ergodic and that the probability that v falls on one of the links equals zero: hence, a single input will activate two or more quadrilaterals only when these quadrilaterals overlap. (Note that several quadrilaterals may be active at the same time when the lattice is tangled.) Define Si as the set of 2d quadrilaterals that have neuron i as a common vertex (e.g., for neuron j in Figure 1B, Sj comprises Hh , Hi , He , H f ). Neuron i is said to be active if at least one of the quadrilaterals of the set Si is active (i.e., with nonzero code membership function output). The d-dimensional maximum entropy learning rule (MER) is defined as 1wi = η
X
1lHj (v) Sgn(v − wi ),
∀i ∈ A,
(2.2)
Hj ∈Si
with η the learning rate (a positive constant) and Sgn(.) the sign function acting componentwise. The set Si guarantees that only the vertices of the active quadrilaterals are updated. In appendix A, we show that the average of equation 2.2 converges since it performs stochastic gradient descent on a positive definite cost function E (i.e., a Lyapunov function).
The Formation of Topographic Maps
597
Figure 1: Definition of quantization region. (A) Portion of a two-dimensional lattice A with a rectangular topology, represented in V-space. To every line crossing (vertex) corresponds a neuron weight vector. The bold line shows the outer border of the lattice. (B) The same portion of the lattice but with some of the neurons and quadrilaterals labeled. The quadrilaterals inside the lattice, labeled He , H f , Hh , and Hi , are the granular quantization regions; those that are labeled Ha , Hb , Hc , Hd , and Hg are the overload regions, and they are constructed by extending the links connecting the neuron on the border with its immediate neighbors (dashed lines).
We know that the unconditional entropy is maximal when each of the N neurons is active with equal probability. In appendix A we show that for the one-dimensional case, MER is guaranteed to yield a unique and equiprobable activity distribution: P(1) = · · · = P(i) = · · · P(N) = N1 , with P(i) the probability that neuron i is active; for the general d-dimensional case we show that the output (weight) density is proportional to the input density for large N (i.e., an equiprobable quantization). Furthermore, also in appendix A, we show that a tangled lattice cannot be a stable configuration in the d-dimensional (as well as in the one-dimensional) case. 3 Mutual Information Maximization From the previous section we know that MER will yield a topographic map that maximizes the unconditional entropy of its N binary outputs, given the input distribution p(v). The binary outputs result from quantizing the input signal v (i.e., the 1lHj ’s). Define F(y, x) as the mutual information between the network’s binary output y and the network’s desired binary output x in response to noiseless input signals. The mutual information can be written as F(y, x) = I(y) − I(y | x),
(3.1)
598
Marc M. Van Hulle
with I(y) the unconditional entropy of y and I(y | x) the conditional entropy of y, given x. By maximizing F, we maximize the correspondence between the y and x vectors. Assume that we dispose of two types of input signals: the input signal v, which is the sum of a clean input signal s and input noise n (additive noise), and the input noise signal n. Define further ps as the clean signal distribution, pn as the noise distribution, and psn as the distribution of v, that is, the input signal distribution in the presence of additive noise. Linsker (1992) suggested maximizing the mutual information between the signal portion of the noisy input (additive noise) and the output response to the noisy input, and introduced a learning strategy in which gradient ascent was performed on F directly, with a negative learning rate for I(y | x). Here we will propose a different, much simpler learning strategy for maximizing our definition of average mutual information. We can rewrite equation 3.1 as F(y, x) = I(x) − I(x | y), with I(x) the unconditional entropy of x, and with I(x | y) the average uncertainty about x when we know y (equivocation). Hence, a y vector does not contribute to I(x | y) when its activation region coincides with that of an x vector, or when it is a subset thereof. Since I(x | y) is expected to decrease as the distribution of y vectors approaches the desired distribution of x vectors, the mutual information F will be maximal when I(x | y) is minimal. (Note that max(F) = max(I(x)) = log N.) Hence, F is a function of the binary output vectors y only, and the problem reduces to one of determining the topographic map that maximizes the unconditional entropy of the distribution of output responses to ps . This will be achieved in the following way. We use MER to derive two maps: one that maximizes the unconditional entropy of the map’s output corresponding to psn and another that maximizes it for pn . The purpose of this entropy maximization is to derive nonparametric models of both input distributions: since entropy maximization leads to an equiprobable quantization and since, by the latter, the weight distribution closely approximates the input signal distribution for large N, MER can be used for reconstructing the shape of psn and pn (e.g., using splines). (Note that by the separation of density estimation from density reconstruction, such an approach is different from other adaptive nonparametric density estimation approaches, such as adaptive kernel- and nearestneighbor density estimation; see Silverman 1986.) Once psn and pn are modeled in this way, we can infer the clean signal distribution: since the noise is additive, psn equals the convolution of ps and pn . Hence, in principle, we can obtain the desired clean signal distribution ps by deconvolution. We can then reapply MER to determine the weight vectors that maximize the unconditional entropy of the distribution of output responses to ps and, thus, that maximize F. For the sake of exposition, we have used in the simulations a procedure that is even simpler than deconvolution. Assume that both input distributions are univariate gaussians and that N is odd. Hence, at convergence, the averaged weights wmid , with mid = (N − 1)/2 + 1, represent also the
The Formation of Topographic Maps
599
means of the corresponding distributions. The weights for p ps are inferred from those for psn and pn in the following way: wjs = 1 − ρ 2 wjsn with ρ = wjn − wnmid /wjsn − wsn mid , ∀j, j 6= mid. Evidently this leads to the correct solution when psn and pn are gaussians; for other unimodal, symmetric distributions, with small higher-than-second-order moments, it is an approximation. Finally, we note that the estimation of psn and pn is a common procedure in model-based noise-reduction schemes. For example, in speech applications, one obtains estimates of (environmental) noise when no speech signal is present. Knowledge of the underlying input distributions greatly improves the quality of, for example, blind signal identification (Pajunen et al. 1996), speech enhancement (Xie & Van Compernolle 1994), speech recognition (Rabiner & Juang 1993) and also for hybrid systems (based on mutual information maximization; see Rigoll 1994), adaptive waveform coding (Martinez & Yang 1995), and adaptive channel equalization (Adali et al. 1995). 4 Simulations We run two networks, one for psn and another for pn , until convergence and determine the weights of ps , as explained in the previous section. We then compute numerically the neural activations for ps given these weights. Finally, we display the performance of our MER-based procedure. Since entropy is not a very sensitive metric, due to its logarithmic nature, we plot the mean squared error (MSE) between the present neural activations for ps and the desired equiprobable ones. For MER, we use the fast version described in appendix B; the learning rate is η = 0.0001. We will give two examples. First, we consider the standard case where the samples from ps and pn are randomly and independently drawn from zero-mean gaussians with standard deviations σs and σn , respectively. The result is shown in Figure 2A for σs = 0.5 and different values for σn and for N = 17 and 33. We observe that the MER-based procedure yields a rather accurate estimate of ps up to σn = 0.5 (signal-to-noise ratio—SNR—of 0 dB). Second, in order to show that our MER-based procedure also works for nongaussian distributions, we consider the case where ps is a zero-mean symmetric product distribution and pn a zero-mean uniform distribution. The symmetric product distribution is a strongly peaked p.d. f. that is generated by taking the product of two that are uniformly dis√ √ random numbers tributed within the ranges [0, 1/ 2) or (−1/ 2, 0]; both ranges are equally probable. The range of the product distribution is (−1/2, 1/2) and that of the uniform distribution is [−a, a). The product distribution falls off log hyperbolically from the center. The result is shown in Figure 2B. We observe that the MER-based procedure yields a significant improvement at least up to a = 0.5 (SNR = −4.8 dB). In order to verify the effectiveness of our “gaussian” approximation, we have applied a deconvolution procedure in
600
Marc M. Van Hulle
Figure 2: Mutual information maximization performance of the MER-based procedure. The performance is expressed as the MSE between the neural activations for the clean signal distribution and the desired, equiprobable ones. The thin and thick solid lines show the MSE result for N = 17 and 33, respectively. The thin and thick dashed lines show the corresponding MSE results in case only psn is maximized. This serves as a reference against which the performance of the MER-based procedure is to be compared. (A) Result in case of zero-mean guassians for the clean and the noise signal with standard deviations σs and σn , respectively; σs = 0.5. (B) Result in case of the symmetric product distribution with range (−1/2, 1/2) for the clean signal and the uniform √ distribution with range [−a, a) for the noise signal; σs ∼ = 0.166 and σn = a/ 3. The fourth moment of the clean signal ∼ = 2.5 10−3 . The thin and thick dot-dash lines denote the results obtained in case of deconvolution for N = 17 and 33, respectively.
which MER is used for estimating psn and pn (the deconvolution procedure is detailed in appendix C). However, we observe that deconvolution yields inferior results, at least for the range of a shown (dot-dash lines). The inferior performance is due to the presence of the sharp peak in ps , which makes the latter difficult to reconstruct. We also observe that the procedure breaks down below a = 0.2. Finally, we note that the I(y) values obtained in our simulations were typically within 2% of the theoretically attainable range, although the weights could be significantly different from the optimal ones. Hence, we feel that performing gradient ascent on I(y) directly is not recommended by its logarithmic nature, especially when the (decrease in the) progress made in I(y) is used in the stopping criterion. Appendix A: Properties of MER Proposition 1.
For a discrete and statistically stationary input probability den-
The Formation of Topographic Maps
601
sity p(v), equation 2.2 on average (for small enough η) performs (stochastic) gradient descent on the cost function E=
N X M X X
1lHj (vk )|vk − wi |,
(A.1)
k=1 i=1 Hj ∈Si
with {vk } the set of input signals, and |vk − wi | = “per-letter” distortion).
Pd
k j=1 |vj − wij | (i.e., the absolute
Proof. Following the definition of gradient descent, the weights are updated so as to reduce the cost term introduced by the active quantization regions (defined by the weight vectors of the previous iteration step). Hence, since equation A.1 is continuous and piecewise differentiable, we obtain h1wi iV = −η
M X X ∂E =η 1lHj (vk ) Sgn(vk − wi ), ∂wi k=1 H ∈S j
∀i ∈ A,
(A.2)
i
and the latter exactly corresponds to the right-hand side of equation 2.2 summed over all input patterns. Since E is always positive definite and its partial derivatives (equation A.2) are continuous, equation A.1 is a Lyapunov function. For the case where p(v) is continuous, we modify the parallel update scheme of MER to a sequential one: (1) one quadrilateral out of the set of active quadrilaterals is chosen randomly, and (2) one vertex out of the vertices of the randomly chosen quadrilateral is updated using MER. Proposition 2. For a continuous and statistically stationary input p.d.f. p(v) and a sequential update scheme, a Lyapunov function exists for averaged MER. Proof. The derivatives of the average learning rule comprise a matrix HE = [HEij ], HEij = ∂h1wi iV /∂wj , with components: HEij = 0, i 6= j, and HEii 6= 0. Since the matrix HE is symmetric, it is a Hessian. Hence, if and only if HE is a Hessian, a Lyapunov function exists. Proposition 3. In the one-dimensional case, MER is guaranteed to yield a set of equiprobable neurons. Proof. Consider a one-dimensional lattice with quantization intervals H1 , . . . , HN+1 defined by the weights w1 , . . . , wN . Due to the existence of a Lyapunov function, we know that MER will converge on average.
602
Marc M. Van Hulle
We have that h1wi iV = 0 = h1lHi+1 − 1lHi iV =
2 P(Hi+1 ) 2 P(Hi ) − , nHi+1 nHi
i = 1, . . . , N,
(A.3)
at convergence, with P(Hi ) the probability that Hi is active (i.e., P(1lHi 6= 0)). Now since, by definition, neuron i is activated when either Hi or Hi+1 or both are activated, the probability that neuron i is activated becomes P(i) =
P(Hi ) P(Hi+1 ) + n Hi nHi+1
P (where we have normalized to keep i P(i) = 1). If we now substitute for the P(Hi )-equalities obtained from equation A.3, we obtain that P(1) = · · · = P(i) = · · · P(N), and thus a set of equiprobable neurons because nHi = 2, 1 < i < N + 1 and nH1 = nHN+1 = 1. Corollary 1. In the one-dimensional case, the averaged neuron weights w1 , . . . , wN at convergence are the medians of the intervals they delimit. Proof.
Follows directly from the equiprobable quantization.
Define a kink at neuron i as the topological defect for which either wi > wi+1 and wi > wi−1 , or wi < wi+1 and wi < wi−1 . Proposition 4. In the one-dimensional case, MER is guaranteed to converge on average to a one-dimensional lattice without kinks. Proof by indirect demonstration. Assume the inverse: a one-dimensional lattice with at least one kink is a stable solution. Assume that there is a kink at neuron i; hence, the intervals Hi and Hi+1 overlap and are located at the same side of i. This signifies that neuron i will receive weight updates only in the direction of both overlapping intervals, and never in the opposite direction. By consequence, neuron i will keep on shifting until the overlap is removed. Hence, the kink at neuron i is not a stable configuration. Since kinks are the only topological defects of one-dimensional lattices, the latter two propositions also signify that MER converges to a unique set of equiprobable neurons: P(1) = · · · = P(i) = · · · P(N) = 1/N. Before we turn to the d-dimensional case, we need an additional definition. Since MER performs weight updates componentwise, we can divide the region within which neuron i can be activated into 2d quadrants: for every input falling in the same quadrant, the magnitude and direction of the weight update vector are the same (cf. the sign functions in MER).
The Formation of Topographic Maps
603
For the higher-than-one-dimensional case, one can easily verify that for each neuron i, the (averaged) weight vector wi at convergence represents the d-dimensional median of Si on condition that the median is defined as the vector of the (scalar) medians in each of the d input dimensions separately. (Note that there exists no unique definition of median in d dimensions.) The weights satisfying this condition are stable equilibrium points of our averaged learning rule. Furthermore, when all opposite quadrants of a neuron carry nonzero activation probabilities, these quadrants will be equiprobable. Proposition 5. In the d-dimensional case, the output (weight) density λ(wi ) at convergence is proportional to the input density p(v), when N grows large and given that the neurons’ activation regions Si span nonzero volumes in V-space. Proof. In the asymptotic situation where N is large, the discrete weight distribution can be expected to approximate a continuous density function λ having unit volume. Then λ 1Vol may be taken as the fraction of weight vectors (or neurons) located in an incremental volume element 1Vol. Thus, the volume of the quantization region Si associated with neuron i is given approximately by Vol(Si ) ≈
1 N λ(wi )
(A.4)
for every bounded region Si (λ 6= 0). (Note that the denominator is the number of points in the neighborhood of wi so that the reciprocal is the volume per neuron.) For large N it is reasonable to assume that most of the regions Si will be bounded sets, and the contribution from the “overload” regions will be negligible. Furthermore, also for large N, we can make the approximation that p(v) ∼ p(wi ), ∀v ∈ Si (given Vol(Si ) 6= 0). We then obtain for the probability for neuron i to be active, .
P(i) = Vol(Si ) p(wi ) ≈
p(wi ) , Nλ(wi )
(A.5)
after substitution of equation A.4. We can also rewrite the contribution made by neuron i to the cost function associated with MER as follows: Z Z p(wi ) |v − wi |dv = |v − wi |dv (A.6) Ei = P(i) Nλ(wi ) Si Si (or with a summation over Si instead of an integral in case of a discrete input distribution). We know that E is a Lyapunov function; hence, at convergence, ∂Ei /∂wi = 0, and wi will be the “median” of Si (our definition). We have also assumed
604
Marc M. Van Hulle
that the volume Vol(Si ) 6= 0; hence, the integral term in equation A.6 will always be positive. Taking the derivative of equation A.6 yields Z p(wi ) ∂ p(wi ) ∂Ei + =0= |v − wi |dv ∂wi ∂w Nλ(w ) Nλ(w i i i) Si Z ∂ |v − wi |dv. × ∂wi Si
(A.7)
We know that the derivative of the integral equals zero, since wi is the “median” of Si , and that the integral is positive and nonzero. Hence, the derivative of the ratio p(wi )/λ(wi ) must be equal to zero: in other words, the ratio is a constant. In conclusion, we have that λ(wi ) ∝ p(v). Furthermore, since each quadrilateral is common to 2d adjacent neurons, λ will be a smooth function. Corollary 2. In the d-dimensional case, we have an equiprobable quantization for large N, given λ a smooth function. Proof. Since for large N and λ smooth, the ratio p/λ reduces to a constant independent of the lattice index i, we have that each Si contributes equally in terms of activation. Hence, we have an equiprobable quantization. Proposition 6.
In the d-dimensional case, a tangled lattice is not a stable solution.
Proof by indirect demonstration. Assume the inverse: a lattice with at least one topological defect is a stable solution. Without loss of generality, we can consider a two-dimensional lattice. In case of a topological defect, such as at neuron i, at least two quadrilaterals will overlap. Assume that (part of) the overlap is located in the upper-left quadrant QUL and that an input falls in it: hence, quadrant QUL is activated once. However, due to the twofold overlap, neuron i will be updated twice. (Note that a single weight update is the same for each quadrant.) As a result, the update probabilities do not match the activation probabilities of the assumed equiprobable quadrants, and the average learning rule cannot be stable. Hence, the locally tangled lattice cannot be a stable configuration. Appendix B: Fast Version of One-Dimensional MER In the one-dimensional case, MER becomes: 1wi = η(1lHi+1 − 1lHi ),
i = 1, . . . , N.
(B.1)
The average squared change of weights per iteration (intensity of variation) at convergence is 2η2 /N + 1, for i = 2, N − 1, and 3η2 /N + 1 for i = 1 or N.
The Formation of Topographic Maps
605
The rate of convergence for a p.d. f. with bounded support and for which the weights are initialized outside the p.d. f.’s range is on the order of O(η/N2 ). A faster version of MER is found by updating all weights at each time step: Ã
N+1 X
i X 1lHk 1lHk − 1wi = η N + 1 − i k=1 i k=i+1
! ,
i = 1, . . . , N.
(B.2)
By considering the homogeneous system of averaged equations B.2 at convergence, it can be easily shown that the faster version of MER also yields a set of equiprobable neurons. The intensity of variation is now η2 /(N + 1 − i)i, for i = 2, N − 1, and η2 (2N + 1)/(N + 1)N for i = 1 or N. The rate of convergence is on the order of O(η/N). Hence, for the same N, the faster version of MER converges N times faster and generates an N times smaller intensity of variation at convergence. Hence, we have used this rule in the simulations. Appendix C: Deconvolution The samples taken from pn and psn are quantized using MER. At the midpoints of the resulting intervals, the density estimate is located, and the shape of the underlying density curve is approximated using cubic splines. This approximation is then resampled into 8192 equidistant points in the range [−4, 4]. Deconvolution is performed via the complex cepstrum (Oppenheim & Schafer 1975). The process of deconvolution is quite sensitive to the accuracy with which the shapes of pn and psn are obtained. Therefore, we additionally use optimal filtering: we apply a bandpass filter to psn of which the (integer) low and high cutoff frequencies are chosen in such a way that the convolution of the estimated shape of ps with the shape of pn is as close as possible to the shape of psn (in MSE sense). The adaptation of the cutoff frequencies is nonrecursive: it simply yields a table from which the best combination is chosen. Once the best estimated shape of ps is obtained, the equiprobable intervals are determined numerically by integration (extended trapezoidal rule). The correctness of these intervals is then verified using the true shape of ps : the difference from an equiprobable quantization is expressed in terms of an MSE value and plotted in Figure 2B. Acknowledgments I thank the anonymous reviewers for their valuable comments. I am a research associate of the National Fund for Scientific Research (Belgium), and supported by research grants received from the National Fund (G.0185.96) and the European Commission (ECVnet EP8212).
606
Marc M. Van Hulle
References Adali, T., Liu, X., Li, N., & Sonmez, ¨ M. K. (1995). A maximum partial likelihood framework for channel equalization by distribution learning. Proc. IEEE NNSP95 (pp. 541–550). Atick, J. J., & Redlich, A. N. (1991). Predicting ganglion and simple cell receptive field organizations. Intl J. Neural Syst., 1, 305–315. Foldi´ ¨ ak, P. (1989). Adaptive network for optimal linear feature extraction. In Int’l Joint Conference on Neural Networks (Washington 1989), 1, pp. 401–405. Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1, 402– 411. Linsker, R. (1992). Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Computation, 4, 691–702. Martinez, D., & Yang, W. (1995). A robust backward adaptive quantizer. Proc. IEEE NNSP95 (Cambridge, MA, 1995), (pp. 531–540). Oppenheim, A. V., & Schafer, R. W. (1975). Digital signal processing. Englewood Cliffs, NJ: Prentice Hall. Pajunen, P., Hyv¨arinen, A., & Karhunen, J. (1996). Nonlinear blind source separation by self-organizing maps. In S.-I. Amari, L. Xu, L.-W. Chun, I. King, and K.-S. Leung (Eds.), Progress in neural information processing (pp. 1207– 1210). Proceedings of the International Conference on Neural Information Processing. Rabiner, L., & Juang, B. H. (1993). Fundamentals of speech recognition. Englewood Cliffs, NJ: Prentice Hall. Rigoll, G. (1994). Maximum mutual information neural networks for hybrid connectionist-HMM speech recognition systems. IEEE Trans. Speech and Audio Processing, 2, 175–184. Ritter, H., Martinetz, T., & Schulten, K. (1992). Neural computation and selforganizing maps: An introduction. Reading, MA: Addison-Wesley. Rubner, J., & Schulten, K. (1990). Development of feature detectors for selforganization. Biol. Cybern., 62, 193–199. Sanger, T. (1989). An optimality principle for unsupervised learning. In D. S. Touretzky (Ed.), Advances in neural information processing systems 1 (pp. 11–19). San Mateo, CA: Morgan Kaufmann. Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman and Hall. Xie, F., & Van Compernolle, D. (1994). A family of MLP based nonlinear spectral estimators for noise reduction. Proc. ICASSP’94, 2 (pp. 53–56).
Received August 23, 1995; accepted May 6, 1996.
Communicated by Alexandre Pouget
Self-Organization of Firing Activities in Monkey’s Motor Cortex: Trajectory Computation from Spike Signals Siming Lin Jennie Si Department of Electrical Engineering, Arizona State University, Tempe, AZ 85287 USA
A. B. Schwartz Neurosciences Institute, San Diego, CA 92121 USA
The population vector method has been developed to combine the simultaneous direction-related activities of a population of motor cortical neurons to predict the trajectory of the arm movement. In this article, we consider a self-organizing model of a neural representation of the arm trajectory based on neuronal discharge rates. A self-organizing feature map (SOFM) is used to select the optimal set of weights in the model to determine the contribution of an individual neuron to an overall movement representation. The correspondence between movement directions and discharge patterns of the motor cortical neurons is established in the output map. The topology-preserving property of the SOFM is used to analyze the recorded data of a behaving monkey. The data used in this analysis were taken while the monkey was tracing spirals and doing center→out movements. The arm trajectory could be well predicted using such a statistical model based on the motor cortex neuronal firing information. The SOFM method is compared with the population vector method, which extracts information related to trajectory by assuming that each cell has a fixed preferred direction during the task. This implies that these cells are acting along lines labeled only for direction. However, extradirectional information is carried in these cell responses. The SOFM has the capability of extracting not only direction-related information but also other parameters that are consistently represented in the activity of the recorded population of cells.
1 Introduction The population vector method was developed by Georgopoulos and his collaborators to investigate the relations between the neuronal activity in the motor cortex of the monkey and the direction of arm movements in twoand three-dimensional spaces (Georgopoulos et al. 1982, 1983, 1984, 1988). Neural Computation 9, 607–621 (1997)
c 1997 Massachusetts Institute of Technology °
608
Siming Lin, Jennie Si, and A. B. Schwartz
In the study of reaching movement in a two-dimensional space, monkeys were trained to make point-to-point movements from a center start position to one of eight targets spaced equally around a circle (Schwartz 1992). Neurons recorded during this task were found to have mean discharge rates that were highest in a single “preferred” direction and tapered off gradually in directions farther away from the preferred direction, firing at their lowest rates for the movement directions 180 degrees from the preferred direction. The relation between movement directions and discharge rates for individual neurons could be fit with a cosine tuning function (Georgopoulos et al. 1982), D = b0 + k cos(θ − θ0 ),
(1.1)
where b0 is a regression coefficient corresponding to the mean activity in the task, k is a regression coefficient corresponding to the depth of the modulation across different movement directions, θ is the direction of the movement, θ0 is the preferred direction, and D is the discharge rate. The tuning function of individual motor cortical neurons spans the entire directional domain. Each directional neuron encodes all directions of the movement. Conversely, all neurons in the cortical population simultaneously encode a single movement (Schwartz 1993). If Ci is the unit preferred direction vector for the ith neuron, then the neuronal population vector P(t) is determined as the weighted sum of these vectors, P(t) =
X
ri (t)Ci ,
(1.2)
i
where the weight ri (t) is the activity (discharge rate minus the geometric mean) of the ith neuron at time bin t. The discharge rate can be measured during the monkey’s arm movements, resulting in a series of weighted vectors for each neuron. When these vectors are summed across the population for each bin, the resulting series of population vectors P(t) represents well the direction of the movement as it changes during the task. Moreover, the magnitudes of these population vectors suggest instantaneous displacement (speed) throughout the task. Thus, connecting these population vectors tip-to-tail together, one may obtain a neural trajectory. It was shown that real trajectories of arm movement could be predicted by neural trajectories (Georgopoulos et al. 1988; Schwartz 1993, 1994). To use the population vector approach described, one needs to know the preferred direction of every neuron in the cortical population. Usually this is done by experiments. Any bias in the preferred direction vectors could result in prediction error in the trajectory. Some other methods have been used for population decoding. The simulated annealing algorithm has been used to adjust the connection strengths of a feedback neural network so that it would generate a given trajectory
Self-Organization of Firing Activities in Monkey’s Motor Cortex
609
by a sequence of population vectors (Lukashin and Georgopoulos 1994). As in the population vector algorithm, the preferred direction is required in this algorithm. The optimal linear estimator (OLE) is a statistical method for population decoding (Salinas & Abbott 1994). The estimation formula of OLE is X ri (t)Di , (1.3) Vest (t) = i
where Vest is the estimate of the movement direction and ri is the measured firing rate (or normalized firing rate) of neuron i. Note that Di is different from the preferred direction. Specifically, Di =
X
Q−1 ij Lj ,
(1.4)
Vfj (V) dV,
(1.5)
j
with
Z Lj =
and Q is the correlation matrix of firing rates determined as Z Qij = ri rj P(r | V) dr dV,
(1.6)
where V denotes the actual movement direction, P(r | V) denotes the probability of obtaining the firing response r given that the movement direction takes the value V, and fi (V) is the average discharge rate of cell i when the movement direction takes the value V. Equivalently, Z (1.7) fi (V) = ri P(r | V) dr. It is shown that the cosine tuning curve is the optimal case for linear decoding methods, and the OLE method can produce the smallest average error of the linear methods (Salinas & Abbott 1994). However the OLE method may encounter difficulty when the number of cells used is large and when Q is ill conditioned. When noise is present, the recorded discharge rates are not perfect cosine tuning curves anymore, and the assumption made for guaranteeing the performance of OLE is no longer true. One of the objectives of this study is to provide a statistical model that extracts directional patterns encoded in the discharge rates with the following advantages: (1) without using the preferred directions, (2) robust against noise, and (3) easy to implement in hardware. The population vector method suggests a close association between the discharge rates of cells in the motor cortex and the direction of arm movement.
610
Siming Lin, Jennie Si, and A. B. Schwartz
In this article, we consider a self-organizing model of the neuronal discharge patterns based on neuronal discharge rates. The inputs to the model are averaged discharge rates from a group of cells in the monkey’s motor cortex. These discharge rates were taken while the animal was tracing spirals or doing center→out reaching. The output map of the model is a two-dimensional plane. One dimension of the polar coordinate codes the predicted direction of arm movement; the other contains nonspecific movement-associated parameters such as speed and position. The nodes in the output map are interconnected by a neighborhood function during training. SOFM was used to adapt the weights of the network to generate a mapping between discharge rates of motor cortical neurons and directions of arm movement. The main advantage of this approach is that preferred directions are not prerequisites for obtaining predicted directions. Consequently the possibility of error accumulation from nonuniformities in preferred directions is reduced. Moreover, the topology-preserving property of the SOFM enables us to calibrate the output map with the topology relation within input discharge rates, which is useful when interpreting the computation results. Here we focus on the statistical discrimination of behavioral modes while the monkey was tracing spirals. The computer simulations reveal that the behavioral modes are clearly differentiated, and the map simultaneously provides a clear topological relationship among these modes. The movement directions during the spiral tracing task are well predicted by the discharge rate patterns using the self-organizing model. The population vector method has already been used to show that directions and speed are well represented in the activity of these cells (Schwartz 1993, 1994). This method assumes the linear functional relationship between discharge rates and movement directions as shown in equation 1.2. It may not capture all the consistent relationships between cell activity and different parameters that could be represented in this activity. In contrast, the statistical nature of the SOFM technique will capture these relationships without a priori assumptions about what is represented or the form of the code.
2 Spike Signal and Feature Extraction A rhesus monkey was trained to trace with its index finger on a touchsensitive computer monitor. In the spiral tracing task, the monkey was trained to make a natural, smooth drawing movement within approximately a 1 cm band around the presented figure. As the figure appeared, the target circle of about 1 cm radius moved a small increment along the spiral figure, and the monkey moved its finger along the screen surface to the target. The spirals we study in this article are outside→in spirals. (A more detailed description about this process can be found in Schwartz 1992, 1994.)
Self-Organization of Firing Activities in Monkey’s Motor Cortex
611
Figure 1: Calculation of dij : the number of spike intervals overlapping with the jth time interval is determined to be 3. As shown, two-thirds of the first interval, all of the second interval, and one-fourth of the third interval are located in the jth time window. Therefore, dij =
2 +1+ 1 3 4
dt
.
While the well-trained monkey was performing the previously learned routine, the occurrence of each spike signal in an isolated cell in the motor cortex was recorded. The result of these recordings is a spike train at the following typical time instances: τ1 , τ2 , . . . , τL . This spike train can be ideally modeled by a series of δ functions at the appropriate times: Sc (t) =
L X
δ(t − τl ).
(2.1)
l=1
Let Ä be the collection of the monkey’s motor cortex cells whose activities contribute to the monkey’s arm movement. Realistically we can record only a set of sample cells S = {Si ∈ Ä, i = 1, 2, . . . , n} to analyze the relation between discharge rate patterns and the monkey’s arm movements. We assume that the cells included in the set S constitute a good representation of the entire population. The firing activity of each cell Si defines a point process fsi (t). Let dij denote the average discharge rate for the ith cell at the jth interval (tj , tj + dt). Taking dij of all the cells {Si ∈ Ä, i = 1, 2, . . . , n} in (tj , tj + dt), we obtain a vector Xj = [d1j , d2j , . . . , dnj ] which will later be used as the feature input vector relating to the monkey’s arm movement during the time interval (tj , tj + dt). The average discharge rate dij = Dij /dt is calculated from the raw spike train as follows. For the jth time interval (tj , tj +dt), one first has to determine all the intervals between any two neighboring spikes that overlap with the jth time interval. Each of the spike intervals overlapping completely with (tj , tj + dt) contributes a 1 to Dij . Each of those overlapping only partially with (tj , tj + dt) will contribute the percentage of the overlapping to Dij . An illustrative example is shown in Figure 1. Due to limitations in experimental realization, the firing activities of the n cells in S could not be recorded simultaneously. Consequently the average
612
Siming Lin, Jennie Si, and A. B. Schwartz
discharge rate vector Xj could not be calculated simultaneously at time tj . However, special care was given to ensure that the monkey repeated the same experiment n times to obtain the firing signal for all the n cells under almost the same experimental condition. We thus assume that the experiments of recording every single cell’s spike signal were identically independent events so that we could use spike signals of the n cells as if they were recorded simultaneously. In our simulations, spike signals from 81 cells in the motor cortex were used. The average discharge rates were calculated every 20 ms. The entire time span of each trial may be a little different from one another. Cubic interpolations were used to normalize the recorded data (average discharge rates and movement directions in degrees) to 100 time bins. At a given bin, normalized directions (in degrees) were averaged across 81 cells, giving rise to a finger trajectory sequence: {∠1 , ∠2 , . . . , ∠100 }. Correspondingly at every time bin j, we have a normalized discharge rate vector Xj = [d1j , d2j , . . . , d81j ]T , j = 1, 2, . . . , 100. Discharge rate vectors and moving directions were used to train and label the SOFM. 3 The Computation Model The SOFM has been applied successfully to speech processing (Kohonen 1988), robotics (Ritter & Schulten 1986), vector quantization (Nasrabadi & Feng 1988; Yair & Zeger 1992), and biological modeling (Obermayer et al. 1990). The SOFM learns a mapping from the input data space
∀p 6∈ Nc (t),
∀p ∈ Nc (t)
(3.1) (3.2)
where t = 0, 1, 2, . . . is the discrete-time coordinate, α(t) is the learning rate factor, and Nc (t) defines a set of neighborhood nodes of the winner node c. The winner node c is defined as the node whose weight vector has the smallest Euclidean distance from the input X(t): kWc (t) − X(t)k ≤ kWp (t) − X(t)k
∀p.
(3.3)
In our current application, we have used a 20×20 two-dimensional lattice as the output layer and α(t) = 0.95(1 − t/N) with N = 200,000. We started with the initial radius of Nc (t), which equals the diameter of the output layer, and let it shrink with time.
Self-Organization of Firing Activities in Monkey’s Motor Cortex
613
In our application, the SOFM acts as a decoder. By finding the winners, SOFM maps the input onto a node in the output layer. Every node in the output layer decodes an association with inputs. The critical feature of this algorithm is the topology preserving. We call a map topology preserving if (1) similar input vectors are mapped onto identical or neighboring output nodes, and (2) neighboring nodes have similar weight vectors. Property 1 ensures that small changes in the input vector cause correspondingly small changes in the location of the output winner. This makes SOFM robust against distortions of inputs. Property 2 ensures robustness of the inverse mapping, which is critical for the accuracy of SOFM as a decoder. 4 Trajectory Computation from Motor Cortical Discharge Rates The major objective of this study is to investigate the relationship between a monkey’s arm movements and firing patterns in the motor cortex. The topology-preserving property of the SOFM is used to analyze the recorded data of a behaving monkey. To train the network, we select n cells (in this study, n = 81) in the motor cortex. The average discharge rates of these n cells constitute a discharge rate vector Xj = [d1j , d2j , . . . , dnj ]T , where dij , i = 1, . . . , n, is the average discharge rate of the ith cell at time bin tj . A set of vectors {Xk | k = 1, 2, . . . , NP} recorded from experiments are the training patterns of the network, where NP is the number of training vectors. During training, a vector X ∈ {Xk | k = 1, 2, . . . , NP} is selected from the data set at random. The discharge rate vectors are not labeled or segmented in any way in the training phase. All the features present in the original discharge rate patterns will contribute to the self-organization of the map. Once the training is done, the network has learned the classification and topology relations in the input data space. Such a “learned” network is then calibrated using the discharge rate vectors of which classifications are known. For instance, if a node i in the output layer wins most for discharge rate patterns from a certain movement direction, then the node will be labeled with this direction. In this part of the simulation, we used numbers 1 ∼ 16 to label the nodes in the output layer of the network, representing the 16 quantized directions equally distributed around a circle. For instance, a node with the label 4 codes the movement direction of 90 degrees. If a node in the output layer of the network never wins in the training phase, we label the node with a −1. In the testing phase, a −1 node can become a “winner” just like those marked with positive numbers. Then the movement direction is read out from its nearest node with a positive direction number. Figure 2 gives the learning results using discharge rates recorded from the 81 cells in the motor cortex in one trial of the spiral task. From the output map, we can clearly identify three circle-shaped patterns representing the spiral trajectory. These results show that the monkey’s arm movement directions are clearly encoded in firing patterns of the motor cortex. By the
614
Siming Lin, Jennie Si, and A. B. Schwartz
Figure 2: The self-organized map for the discharge rates of the 81 neurons in the motor cortex in the spiral task. The size of the output map of the SOFM is 20×20. The nodes marked with positive numbers are cluster centers, representing the directions as labeled, and the nodes labeled −1 are the neighbors of the centers. The closer a node is to the center, the more possible it represents the coded direction of the center.
property of topology preserving, we can distinguish a clear pattern from the locations of the nodes in the map. Note that nodes with the same numbers (i.e., directions) on different spiral cycles are adjacent. This is the same for the actual trajectory and is what we mean by topology preserving, an attribute of the SOFM. Different cycles of the spiral are located in correspondingly different locations of the feature map, suggesting that parameters in addition to directions influence cortical discharge rates. 4.1 Using Data from Spiral Tasks to Train the SOFM. In this experiment, we used data recorded from five trials in spiral tracing tasks. The first trial was used for testing and the rest for training. After the training and labeling procedure, every node in the output map of the SOFM was associated with one specific direction. The direction of a winner was called “neural direction” when discharge rate vectors were presented as the input of the
Self-Organization of Firing Activities in Monkey’s Motor Cortex
615
Figure 3: The neural directions and finger directions in 100 bin series in the inward spiral tracing task.
network bin by bin in the testing phase. Figure 3 gives neural directions against the corresponding monkey’s finger movement directions. After postfiltering the predicted neural directions with a moving average filter of 3-bin width, we combined the neural directions tip-to-tail bin by bin. The resulting trace (neural trajectory) is similar to the finger trajectory. Note that the neural directions are unit vectors. To use vectors as a basis for trajectory construction, the vector magnitude must correspond to movement speed. During drawing, there is a “two-thirds power law” (Schwartz 1994) relation between curvature and speed. Therefore, the speed remains almost constant during the spiral drawing. This is the reason we can use unit vectors in this construction and still get a shape that resembles the drawn figure. Figure 4 demonstrates the complete continuous monkey’s finger movement and the predicted neural trajectory. 4.2 Using Data from Spiral and Center→Out Tasks to Train the SOFM. In addition to the spiral data, the data recorded in the center→out tasks were added to the training data set. Figure 5 shows that the overall performance
616
Siming Lin, Jennie Si, and A. B. Schwartz
Figure 4: Using data from spiral task for training. Left: The monkey’s finger movement trajectory calculated with constant speed. Right: The SOFM neural trajectory.
Figure 5: Neural trajectory: Using data from spiral and center→out tasks for training.
of the SOFM has been improved. This result suggests that the coding of arm movement direction is consistent across the two different tasks. 4.3 Average Testing Result Using the Leave-k-Out Method. The data recorded from five trials in the spiral tasks and the center→out tasks were used in this experiment. Every time we chose one of the trials in spiral tasks for testing and the rest for training. We had five different sets of experiments for training and testing the SOFM. Note that the training and testing data were disjoint in each set of experiments. The neural trajectories from the
Self-Organization of Firing Activities in Monkey’s Motor Cortex
617
Figure 6: Neural trajectory: Average testing result using leave-k-out method.
five tests were averaged, and Figure 6 shows the averaged result. Due to the nonlinear nature of the SOFM network, a different training data set may result in a different outcome in the map. Therefore, the leave-k-out method provides a realistic view on the overall performance of the SOFM. Figure 7 shows the bin-by-bin difference between the neural directions and the finger directions in 100 bins. The dashed line in the figure represents the average error in 100 bins. 4.4 Trajectory Computation by the Population Vector Algorithm. Figures 8 and 9 give the computation results by using the population vector method based on the same data as in the leave-k-out test. The average error in the first circle of the neural trajectory is about 10 degrees. This is comparable to the average error of 8.65 degrees from the SOFM algorithm; however, the error grows over time, as shown in Figure 9. 5 Discussion Our results based on raw data recorded from a monkey’s motor cortex have revealed that the monkey’s arm trajectory is encoded in the averaged discharge rates of the motor cortex. Note from Figure 2 that when the monkey’s finger was moving along the same direction on different cycles of the spiral, one observes distinct cycles on the output map as well. By the topologypreserving property of the SOFM, this suggests that the output nodes decode not only directions but also probably speed, curvature, and position. Although the population vector method has previously shown this information to be present in the motor cortex, the SOFM model captures regularities
618
Siming Lin, Jennie Si, and A. B. Schwartz
Figure 7: The binwise difference between the finger directions and the neural directions in 100 bins. The dashed line is the average error in 100 bins. The neural trajectory is the average testing result of the SOFM using the leave-k-out method.
Figure 8: Neural trajectory computed by using the population vector method. The same data of the 81 cells used in the leave-k-out method were used.
about speed and curvature without assuming any linear functional relationships to discharge rates as the population vector and OLE method do in equations 1.2 and 1.3, respectively. Although we did not explicitly examine speed in this study, it has been found that the population vector magnitude is highly correlated with speed (Schwartz 1993), showing that this parame-
Self-Organization of Firing Activities in Monkey’s Motor Cortex
619
Figure 9: The binwise difference between the finger directions and the neural directions in 100 bins. The dashed line is the average error in 100 bins. The neural trajectory is computed by using the population vector method.
ter is represented in motor cortical activity. For those drawing movements with more significant changes in speed, we may quantize the speed as we did with directions. As we discussed in section 4, the direction and the curvature or speed may be encoded in discharge rates as a feature unit; we can then label the nodes in the output map using directions and speed together. Consequently the same process of direction recognition by using the SOFM could be applied to speed recognition based on discharge rates of the motor cortex. Other types of networks, such as the sigmoid feedforward and radial basis function networks, could have been used to associate discharge rates with movement directions. The SOFM, however, offers a unique combination of both data association (which some other networks can achieve) and topology information as represented by the two-dimensional output map (which is hard or even impossible for other networks to implement). Hebb hypothesized in 1949 that the basic information processing unit in the cortex is a cell assembly, which may include thousands of cells in a highly interconnected network. This cell assembly hypothesis shifts the focus from a single cell to the complete network activity. In our current study, information from 81 cells in the monkey’s motor cortex has been used to predict the monkey’s arm movement directions. Our results show that these 81 cells have characterized the monkey’s motor cortical activities quite well in a spiral arm movement. Other studies, such as synfire chain (Abeles 1991) and multicell correlation (Vaadia et al. 1991), have also revealed the significance of the amount of information provided by local measurements with a limited number of cells in the associative cortex. It is quite intriguing that
620
Siming Lin, Jennie Si, and A. B. Schwartz
a small set of cells could present the hypothesized cell assembly quite well in certain tasks. As mentioned in section 2, spike signals of the 81 cells were recorded in 81 independent experiments. Extracellular electrophysiological measurements are now being extracted through simultaneous recording on a few randomly selected cells. We plan to use the SOFM to extract movement information efficiently in real time from small groups of simultaneously recorded units. Acknowledgments This research was supported in part by the Whitaker Foundation. The first two authors’ research is also supported in part by NSF under grant ECS9552302. Work carried out by the third author while at Barrow Neurological Institute was supported by NIH R01 NS-26375. References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Georgopoulos, A. P., Kalaska, J. F., Caminiti, R., & Massey, J. T. (1982). The relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. J. Neurosci., 2 (11), 1527–1537. Georgopoulos, A. P., Caminiti, R., Kalaska, J. F., Crutcher, M. D., & Massey, J. T. (1983). Spatial coding of movement: A hypothesis concerning the coding of movement direction by motor cortical populations. Exp. Brain Res. Suppl., 7, 327–336. Georgopoulos, A. P., Kalaska, J. F., Crutcher, M. D., Caminiti, R., & Massey, J. T. (1984). The representation of movement direction in the motor cortex: Single cell and population studies. In G. M. Edelman, W. E. Goll, & W. M. Cowan (Eds.), Dynamic aspects of neocortical function (pp. 501–524). New York: Wiley. Georgopoulos, A. P., Kettner, R. E., & Schwartz, A. B. (1988). Primate motor cortex and free arm movements to visual targets in 3-D space. II. Coding of the direction of movement by a neuronal population. J. Neurosci., 8(8), 2928–2937. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Kohonen, T. (1988). The “neural” phonetic typewriter. Computer, 21(3), 11–22. Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78, 1464– 1480. Lukashin, A. V., & Georgopoulos, A. P. (1994). A neural network for coding of trajectories by time series of neuronal population vectors. Neural Computation, 6, 19–28. Nasrabadi, N. M., & Feng, Y. (1988). Vector quantization of images based upon the Kohonen self-organizing feature maps. IEEE International Conference on Neural Networks, San Diego, pp. 1101–1108. Obermayer, K. H., Ritter, H., and Schulten, K. (1990). Large-scale simulations of self-organizing neural networks on parallel computers: Application to biological modeling. Parallel Computing, 14, 381–404.
Self-Organization of Firing Activities in Monkey’s Motor Cortex
621
Ritter, H., & Schulten, K. (1986). Topology conserving mappings for learning motor tasks. AIP Conference Proceedings, 151, 376–380. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from firing rate. Journal of Computational Neuroscience, 1, 89–108. Schwartz, A. B. (1992). Motor cortical activity during drawing movements: Single-unit activity during sinusoid tracing. J. Neurophysiology, 68, 528–541. Schwartz, A. B. (1993). Motor cortical activity during drawing movements: Population representation during sinusoid tracing. J. Neurophysiology, 70, 28–36. Schwartz, A. B. (1994). Direct cortical representation of drawing. Science, 265, 540–542. Vaadia, E., Ahissa, E., Bergman, H., & Lavner, Y. (1991). Correlated activity of neurons: A neural code for higher brain function. In J. Kruger (Ed.), Neural cooperativity (pp. 249–279). Berlin: Springer-Verlag. Yair, E., & Zeger, K. G. (1992). Competitive learning and soft competition for vector quantizer design. IEEE Trans. on Signal Processing, 40, 294–309.
Received September 20, 1995; accepted May 14, 1996.
Communicated by David MacKay
Hyperparameter Selection for Self-Organizing Maps Akio Utsugi National Institute of Bioscience and Human Technology, Higashi Tsukuba Ibaraki 305, Japan
The self-organizing map (SOM) algorithm for finite data is derived as an approximate maximum a posteriori estimation algorithm for a gaussian mixture model with a gaussian smoothing prior, which is equivalent to a generalized deformable model (GDM). For this model, objective criteria for selecting hyperparameters are obtained on the basis of empirical Bayesian estimation and cross-validation, which are representative model selection methods. The properties of these criteria are compared by simulation experiments. These experiments show that the cross-validation methods favor more complex structures than the expected log likelihood supports, which is a measure of compatibility between a model and data distribution. On the other hand, the empirical Bayesian methods have the opposite bias. 1 Introduction Several standard learning methods for neural networks are being reconstructed as an estimation algorithm of a stochastic model. Such statistical treatment of learning enables inference at a higher level than a simple parameter estimation level, such as the evaluation of model reliability and automatic model selection. For example, MacKay (1992) studied backpropagation learning in a unifying Bayesian framework and presented a selection method of hyperparameters and model structure. Among many learning methods, the self-organizing map (SOM) (Kohonen 1988, 1990) is unique from the viewpoint of data analysis because of its ability to extract topological structure hidden in data. However, since this learning was originally defined only at an algorithmic level, its development to higher-level inference is difficult. Statistical models behaving in a similar manner as SOM are also studied. The elastic net (Durbin et al. 1989), which is one of the generalized deformable models (GDM) (Yuille 1990), learns a topology-preserving map as maximum a posteriori (MAP) estimates for the parameters of a Bayesian stochastic model. Although this model was studied originally in the context of an optimization problem such as the traveling salesman problem, it can be used for smoothing data along a specified topology. However, this requires the determination of two hyperparameters: the size of noise and Neural Computation 9, 623–635 (1997)
c 1997 Massachusetts Institute of Technology °
624
Akio Utsugi
the smoothness of route. The selection of these hyperparameters was attempted by an empirical Bayesian method similar to that of MacKay for backpropagation learning (Utsugi 1993). This article first reviews the hyperparameter selection method. Next, the SOM algorithm for finite data is derived as an approximate MAP estimation algorithm for GDM. Thus, SOM and GDM can be considered equivalent at a stochastic model level. Finally, cross-validation, another representative model selection method, is applied to hyperparameter selection for our model and compared with the empirical Bayesian method by simulation experiments. 2 Empirical Bayesian Hyperparameter Selection for GDM 2.1 Construction of GDM as Stochastic Model. Initially, we regard the simplest type of gaussian mixture model as a stochastic model for competitive learning (Nowlan 1990). For a data set X consisting of m-dimensional data points xi = (xi1 , . . . , xim )0 (i = 1, . . . , n), the likelihood function of the model with r components is given by f (X |w, β) =
r n X Y 1 i=1 s=1
r
f (xi |ws , β),
(2.1)
where µ f (xi |ws , β) =
β 2π
¶m/2
¶ µ β exp − kxi − ws k2 2
(2.2)
is a component gaussian density with the centroid ws = (ws1 , . . . , wsm )0 and the variance 1/β, and w = (w01 , . . . , w0r )0 . Furthermore, we consider a gaussian smoothing prior to express smooth variation of the centroids along a topological space. Using a discretized differential operator matrix D on the topological space, the density of this smoothing prior is defined by g(w|α) =
µ ¶ m ³ Y α α ´l/2 (det + D 0 D )1/2 exp − kDw(j) k2 , 2π 2 j=1
(2.3)
where w(j) = (w1j , . . . , wrj )0 , l = rank D 0 D , and det+ D 0 D denotes the product of positive eigenvalues of D 0 D . The hyperparameter α represents the strength of smooth constraint. Although the elastic net uses the firstorder differential operator on a one-dimensional closed-loop topology, we can use various kinds of differential operators and topologies.
Hyperparameter Selection for Self-Organizing Maps
625
From the likelihood (cf. equation 2.1) and the prior (cf. equation 2.3), we obtain the log posterior by Bayes’ theorem: log g(w|X , α, β) = log f (X |w, β) + log g(w|α) + const. µ ¶ n r X X β 2 = log exp − kxi − ws k 2 i=1 s=1 −
m αX kDw(j) k2 + const. 2 j=1
(2.4)
This corresponds to the negative energy function of an elastic net, whose maximizer gives the MAP estimates of centroids. The elastic net algorithm is a MAP estimation algorithm using the gradient ascent method. We can also use the expectation-maximization (EM) algorithm (Yuille et al., 1994; Utsugi 1994), which is explained in section 3. 2.2 Selection of Hyperparameters by Empirical Bayesian Method. Next, we obtain the marginal likelihood of hyperparameters α and β: Z Z f (X |w, β)g(w|α)dw ∝ g(w|X , α, β)dw f (X |α, β) = Z = f (w, X |α, β)dw. (2.5) This is also called the evidence of the hyperparameters. Although we desire to obtain the optimal values of hyperparameters by maximizing the evidence, we have difficulty calculating the integral in equation 2.5 exactly. Here, we use a gaussian approximation (MacKay 1992), where the logarithm of the integrand is substituted by its quadratic approximation at the maximizer. ˆ and the negative Hesse matrix of In this case, using the MAP estimate w the log posterior, ∂2 log g(w|X , α, β), ∂ w∂ w0 the integrand is approximated as µ ¶ 1 ˆ , X |α, β) exp − w0 H (w ˆ )w . f (w, X |α, β) ' f (w 2
H (w) = −
(2.6)
(2.7)
Then the evidence is approximated as Z f (w, X |α, β)g(w|α)dw f (X , Sw ˆ |α, β) = S ˆ Z w
=
g(w|X , α, β)dw Sw ˆ
ˆ ))−1/2 f (w ˆ , X |α, β) ' (2π)rm/2 (det H (w
(2.8)
626
Akio Utsugi
ˆ in the parameter space. This evidence where Sw ˆ is a region dominated by w consists of probability mass on only Sw ˆ , and thus it should be called local evidence. We use the local evidence to select the values of hyperparameters, like MacKay’s manner for backpropagation learning. Now, the log evidence is calculated by µ ¶ n r X X nm β 2 ˆ log β + k |α, β) ' log exp − x − w k log f (X , Sw i s ˆ 2 2 i=1 s=1 +
lm log α 2
−
m αX ˆ (j) k2 kD w 2 j=1
1 ˆ ) + const. − log det H (w 2
(2.9)
ˆ ) is the sum of negative Hesse matrices of the log likelihood The matrix H (w ˆ ) and H g , respectively. The and the log prior, which are denoted by H f (w ˆ ) consists of submatrices, matrix H f (w ∂2 ˆ)=− ˆ , β) H st (w log f (X |w ∂ ws ∂ w0t 2 Pn ˆ s )(xi − w ˆ s )0 + βns I m , β i psi (psi − 1)(xi − w = 2 Pn ˆ s )(xi − w ˆ t )0 , β i psi pti (xi − w (s, t = 1, . . . , r),
s=t s 6= t (2.10)
where I m is an identity matrix with size m. The quantity psi is defined by ˆ s , β) f (xi |w , ˆ s , β) s=1 f (xi |w
psi = Pr
(2.11)
which is called P the fuzzy membership of the ith data to the sth component, and ns = ni=1 psi is the estimated number of data points belonging to the sth component. The matrix H g is given by
H g = αD0 D ⊗ I m ,
(2.12)
where ⊗ denotes the Kronecker product. By maximizing the evidence, we obtain estimates of α and β. Simulation experiments showed that this method gives good solutions (Utsugi 1993).1 1 This evidence is improved from the previously presented one, for which r was used instead of l in equation 2.9. In a wide range of α, these are almost the same. However, the old evidence grows infinitely as α → ∞, while the new one stays finite.
Hyperparameter Selection for Self-Organizing Maps
627
3 Derivation of SOM Algorithm In this section, the original SOM algorithm for finite data is derived as an approximate MAP estimation algorithm for the GDM. Initially, we showed that the EM algorithm of GDM is an alternate iteration of smoothing and soft classification. Now we define binary membership variables Y = (ysi ), where ysi denotes the membership of the ith data point to the sth component. Regarding these variables as missing data, we obtain a complete likelihood, f (X , Y |w, β) =
r µ n Y Y 1 i=1 s=1
r
¶ysi
f (xi |ws , β)
,
(3.1)
which would be an exact likelihood if the missing data were acquired. In the EM algorithm, the following function is maximized instead of the genuine log posterior (cf. equation 2.4), ˆ , β) + log g(w|α), Q(w) = EY (log f (X , Y |w, β)|X , w
(3.2)
ˆ is a temporary estimate and EY ( · | · ) denotes conditional expecwhere w ˆ at the next step. tation with respect to Y . The maximizer of Q is used as w This procedure is iterated until convergence. By calculating Q, we obtain Q(w) = −
r n X m βX αX psi kxi − ws k2 − kDw(j) k2 + const. , 2 i=1 s=1 2 j=1
(3.3)
where psi is the fuzzy membership (cf. equation 2.11) using the temporary ˆ . This maximization of Q is equivalent to the independent miniestimate w mizations of Hj (w(j) ) =
r X
ns (x¯ sj − wsj )2 + γ kDw(j) k2 ,
(j = 1, . . . , m),
(3.4)
s=1
where x¯ sj is the mean of data weighted by the fuzzy membership, x¯ sj =
n 1 X psi xij , ns i=1
(3.5)
and γ = α/β. Each function Hj can be regarded as a discretized Laplacian smoothing criterion (O’Sullivan 1991) for the observations {x¯ sj : s = 1, . . . , r} with confidence weights {ns }, which are obtained through the soft competition process (cf. equation 2.11). A smooth curve minimizing this criterion is used as the next temporary estimate and is given by ¯ (j) , ˜ (j) = KN x w
(3.6)
628
Akio Utsugi
¯ (j) = (x¯ 1j , . . . , x¯ rj )0 and where N is a diagonal matrix consisting of {ns }, x
K = (N + γ D 0 D )−1 .
(3.7)
From the above explanation, the EM algorithm of GDM can be regarded as an alternate iteration of discretized Laplacian smoothing and soft classification. On the other hand, Mulier and Cherkassky (1995) interpreted the batch SOM algorithm as an alternate iteration of kernel smoothing and hard clas¯ t = (x¯ t1 , . . . , x¯ tm )0 , which are the number and the sification. Using nt and x mean of data points in the Voronoi region of the weight point of the tth inner unit, they expressed the weight update rule of SOM as w˜ sj =
r X t=1
κ(s, t)nt x¯ tj /
r X
κ(s, t)nt ,
(j = 1, . . . , m),
(3.8)
t=1
where κ(s, t) is a kernel function, for example, a gaussian density function with respect to s − t for the normal one-dimensional topology. This is regarded as extended Nadaraya-Watson kernel smoothing (H¨ardle 1990) of the observations {x¯ tj } with confidence weights {nt }. In this point, the difference between GDM and SOM is summarized as follows: (1) soft classification versus hard classification, (2) discretized Laplacian smoothing versus kernel smoothing, and (3) in the original SOM, incremental learning is used rather than batch learning. Each method used in SOM can be regarded as an approximation of the associated method used in GDM, as explained below. The soft competition process (cf. equation 2.11) turns hard as β → ∞, and thus hard classification gives a good approximation for soft classification if β is large.2 The discretized Laplacian smoothing can also be approximated by kernel smoothing. As shown in the appendix, a curve smoothed by discretized Laplacian smoothing (cf. equation 3.6) is expressed by the kernel smoothing form (cf. equation 3.8) using the entries of K as the values of kernel function κ(s, t). In reality, this kernel function is variable according to the variation of N , unlike SOM. In many cases, however, N is nearly proportional to I r at the last stage of learning, since the expectation of N is proportional to I r . In particular, for the second-order differential operator on a one-dimensional 2
Also, hard competition can be derived from a MAP estimation algorithm of GDM with a classification likelihood rather than the mixture likelihood (cf. equation 2.1). The classification likelihood has the same form as the complete likelihood (cf. equation 3.1), though Y is regarded as a parameter rather than missing data. In general, classification likelihood approaches lead to poorer results than mixture likelihood approaches (McLachlan & Basford 1988).
Hyperparameter Selection for Self-Organizing Maps
629
line-segment topology, whose entries are −2 1 dij = 0
|i − j + 1| = 0 |i − j + 1| = 1 otherwise,
(i = 1, . . . , r − 2; j = 1, . . . , r)
(3.9)
we can obtain an explicit form of the kernel function using Silverman’s (1984) equivalent kernel of spline smoothing (Utsugi 1994). This kernel function has a Mexican-hat shape with a width proportional to γ 1/4 . Finally, we can also use a generalized EM algorithm (Dempster et al. 1977), where we use ˆ + cw ˜ w = (1 − c)w
(0 < c ≤ 1)
(3.10)
˜ . Using small c, we can make as the next temporary estimate instead of w the variation of the parameter in one leaning step arbitrarily small, where batch and incremental leaning are similar. 4 Comparison of Hyperparameter Selection Methods 4.1 Hyperparameter Selection by Cross-Validation. In section 2, an empirical Bayesian method for hyperparameter selection was studied. Another commonly used method for such problems is cross-validation. In this section, we apply cross-validation to our problem and compare the results with those of the empirical Bayesian method through simulation experiments. Generally the rationale of cross-validation is as follows. The expected log likelihood (ELL) by a true data distribution is a good measure of model adequacy. In reality, we cannot calculate this value because of the unknown true data distribution. Instead, we use a cross-validation score as an unbiased estimate of ELL. However, this estimated criterion has considerable variance for small data, and its maximizer is not an unbiased estimate for the peak position of ELL. In the next simulation, we will calculate ELL, in addition to cross-validation scores and evidence, to observe the bias and variance of these criteria. 4.2 Simulation Experiments. We apply a GDM with 20 components and the differential operator (cf. equation 3.9) to two types of artificial data sets: low and high noise condition (see Figure 1). By using the values of hyperparameters at rectangular grid points in the hyperparameter space, we obtain the graph of each criterion. The landscapes and contour maps in Figures 2 and 3 are obtained by averaging each criterion for 20 different data sets. The histograms of peak positions for individual data sets also are shown in the figures. Figure 4 illustrates the configurations of estimated centroid parameters under several values of (α, β) in the low-noise condition. From these figures, we can conclude the following.
630
Akio Utsugi
Figure 1: Scatter plots for samples of data sets. Artificial data xi = (xi1 , xi2 )0 , (i = 1, . . . , 100) are generated from two independent standard gaussian random series {ei1 } and {ei2 } by xi1 = 4(i−1)/n−2+σ ei1 , xi2 = sin[2π(i−1)/n]+σ ei2 . Two conditions of noise level are used: low noise (σ = 0.3) and high noise (σ = 0.5). For each condition, 20 data sets are used in the simulation.
In the case of low noise (see Figure 2), peak positions of all criteria are close to each other except for a few peaks of cross-validation scores. In the area where the majority of the peaks are gathering, the configurations of centroid parameters have good appearances (see Figure 4, log10 α = 2, log10 β = 1). Thus, we can say that both methods succeed for the most part. However, cross-validation scores have two peaks with much lower α than the other peaks, which lead to configurations that are too complicated. This is probably due to the flatness of the averaged landscape in its low α area and large variance of cross-validation scores. Because of this instability, the crossvalidation method is inferior to the other method in this case, though its averaged landscape is similar to that of ELL. Figure 2: Facing page. Landscapes and contour maps of averaged criteria and peak-position histograms for 20 data sets in the low-noise condition. For each data set, the MAP estimates of centroids are obtained by the EM algorithm of GDM (r = 20). The values of hyperparameters are taken from the grid points in the hyperparameter space. For each fixed β, α is decremented in the grid, and the MAP estimates in the preceding α are used as initial values of centroids. Log evidence is calculated by equation 2.9. Cross-validation scores are obtained in the following manner. A data set is divided randomly into 10 groups of the same size. For each group, centroids are reestimated using the data in all but the group, where the MAP estimates from all data are used as initial values. For the reestimated centroids, their log likelihood is calculated using the data in the group. A cross-validation score is given as the mean of the log likelihoods. ELL is approximated by the mean log likelihood of the MAP estimates evaluated by 1000 newly generated data.
Hyperparameter Selection for Self-Organizing Maps
631
632
Akio Utsugi
Figure 3: Landscapes and contour maps for averaged criteria and peak position histograms in the high-noise condition.
Hyperparameter Selection for Self-Organizing Maps
633
Figure 4: Samples of centroid configurations for several hyperparameter values in the low-noise condition.
In the case of high noise (see Figure 3), the discrepancies among the criteria increase. While cross-validation has a bias toward low α again, evidence comes to have the opposite bias. In this case, we have difficulty choosing between the methods. However, the property that evidence leads to the simplest model unless sufficient data for structure determination are given may be desirable because it agrees with a general strategy of data analysis that linear models rather than nonlinear ones should be used for very noisy data.
634
Akio Utsugi
5 Conclusion We derived the SOM algorithm as an approximate MAP estimation algorithm for a GDM. Several methods to evaluate the quality of hyperparameters for this model were developed by empirical Bayesian estimation and cross-validation. These methods were compared through simulation studies. It was found that the cross-validation methods favor complex structures, while the empirical Bayesian methods favor simple ones. Because of these properties and the long calculation time for cross-validation, the empirical Bayesian methods are recommended for this model. Appendix In this appendix, we consider the relation between kernel smoothing and discretized Laplacian smoothing. Initially, we show
KN 1r = 1r ,
(A.1)
where 1r is an r-dimensional column vector with all ones. In general, D 1r = 0 since D is a differential operator. Thus, using equation 3.7, (KN )−1 1r = (I r + γ N −1 D 0 D )1r = 1r .
(A.2)
This means that (KN )−1 has 1r as an eigenvector with the eigenvalue one; thus KN also have the same property. This leads to equation A.1. From equations 3.6 and A.1, a curve smoothed by discretized Laplacian smoothing is expressed by the kernel smoothing form (cf. equation 3.8) using the entries of K as the values of kernel function κ(s, t). References Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc., Ser. B, 39, 1–38. Durbin, R., Szeliski, R., & Yuille, A. (1989). An analysis of the elastic net approach to the traveling salesman problem. Neural Computation, 1, 348–358. H¨ardle, W. (1990). Smoothing techniques with implementation in S. Berlin: SpringerVerlag. Kohonen, T. (1988). Self-organization and associative memory (2nd ed.). Berlin: Springer-Verlag. Kohonen, T. (1990). The self-organizing map. Proc. IEEE, 78, 1464–1480. MacKay, D. J. C. (1992). A practical Bayesian framework for backprop networks. Neural Computation, 4, 448–472. McLachlan, G. J., & Basford, K. E. (1988). Mixture models: Inference and applications to clustering. New York: Marcel Dekker.
Hyperparameter Selection for Self-Organizing Maps
635
Mulier, F., & Cherkassky, V. (1995). Self-organization as an iterative kernel smoothing process. Neural Computation, 7, 1141–1153. Nowlan, S. J. (1990). Maximum likelihood competitive learning. In Advances in neural information processing systems 2 (pp. 574–582). San Mateo, CA: Morgan Kaufmann. O’Sullivan, F. (1991). Discretized Laplacian smoothing by Fourier methods. J. Amer. Statist. Assoc., 86, 634–642. Silverman, B. W. (1984). Spline smoothing: The equivalent variable kernel method. Ann. Statist., 12, 898–916. Utsugi, A. (1993). A Bayesian model of topology-preserving map learning (in Japanese). Trans. IEICE D-II, J76-D-II, 1232–1239. Utsugi, A. (1994). Lateral interaction in Bayesian self-organizing maps (in Japanese). Trans. IEICE D-II, J77-D-II, 1329–1336. Yuille, A. L. (1990). Generalized deformable models, statistical physics, and matching problems. Neural Computation, 2, 1–24. Yuille, A. L., Stolorz, P., & Utans, J. (1994). Statistical physics, mixture of distributions and the EM algorithm. Neural Computation, 6, 334–340.
Received November 7, 1995; accepted May 7, 1996.
Communicated by Javier Movellan
Supervised Networks That Self-Organize Class Outputs Ramesh R. Sarukkai University of Rochester, Rochester, NY 14627 USA
Supervised, neural network, learning algorithms have proved very successful at solving a variety of learning problems; however, they suffer from a common problem of requiring explicit output labels. In this article, it is shown that pattern classification can be achieved, in a multilayered, feedforward, neural network, without requiring explicit output labels, by a process of supervised self-organization. The class projection is achieved by optimizing appropriate within-class uniformity and between-class discernibility criteria. The mapping function and the class labels are developed together iteratively using the derived selforganizing backpropagation algorithm. The ability of the self-organizing network to generalize on unseen data is also experimentally evaluated on real data sets and compares favorably with the traditional labeled supervision with neural networks. In addition, interesting features emerge out of the proposed self-organizing supervision, which are absent in conventional approaches.
1 Motivation Supervised learning in conventional multilayered neural networks can be formulated without explicitly specifying output label functions. The goal of supervised learning systems is to determine a function F (X ), which “agrees” with the training examples of the form hXI , G (XI )i. Usually the function F ( ) takes on values from a discrete set of “classes”: {CO , . . . , CK }. Many algorithms have been proposed for such a learning task: decision tree methods such as CART (Breiman et al. 1984) and multilayered feedforward networks (Rumelhart et al. 1987) trained by the “error-backpropagation” algorithm, to name just two. In practice, however, there is an intermediate mapping that is needed, namely, label assignments. Formally, distinct functions, LCI , are defined for each class CI and are referred to as the labels. Thus, the learning system finds an approximate definition of the function F (X ) such that, given training examples, hXI , G (XI )i, L(F (XI )) = L(G (XI )). The popular error-backpropagation algorithm (Rumelhart et al. 1987), for training multilayered, feedforward, artificial neural networks, has been formulated on the explicit assumption that an output labeling function L( ) is available. Neural Computation 9, 637–648 (1997)
c 1997 Massachusetts Institute of Technology °
638
Ramesh R. Sarukkai
Biologically, output labels are available for certain, but not all, inputoutput mappings. All of the above schemes share a common deficiency in that the labeling function, L( ), has to be predetermined. Why not formulate self-organizing systems, which develop their own output codes based on the supervisory signals? This article presents an approach whereby conventional multilayered feedforward networks, under supervision, self-organize their own output labels. In particular, I show that supervised learning with neural networks can be viewed as a process of dimensionality reduction, with additional constraints: a within-class constraint specifying that the output labels (or projections to the output space) must be the same for inputs belonging to the same class and a between-class constraint specifying that the output labels should be different for different classes. Thus, the role of supervision is only to suggest to the learning machine which inputs belong together and which do not. 2 Separability Enhancement by Nonlinear Mappings The idea of viewing supervised learning as a process of class separability enhancing, dimensionality reduction was first presented in Fisher’s (1936) classic work, which was the origin of discriminant analysis. Class-enhancing projections using nonlinear mappings have been extensively researched in classical feature extraction theory. Typically iterative techniques proceed by starting with random values and search for the “good” projections based on steepest-descent methods using criteria functions (Duda & Hart 1973; Fukunaga 1972; Young & Calvert 1974) such as monotonicity, stress, and continuity. The criteria functions are usually structure-preserving measures. The algorithms terminate when no further improvement can be obtained in the projections, and the mapping function is then usually found by a least-squares technique. In the neural network literature (Rumelhart et al. 1987; Hertz et al. 1991), supervised learning is usually implicitly tied to the output labeling function. Perhaps this was due to the ability of elegantly formulating gradient weight adaptation equations (i.e., error is related to teaching output−actual output). Independent of my work, discriminant analysis neural networks have been studied (Kuhnel & Tavan 1991; Mao & Jain 1993); however, their formulation is based on principal component analysis (PCA) networks, and cannot be generalized to multilayered, feedforward networks. The learning vector quantization (LVQ) algorithms proposed by Kohonen (Hertz et al. 1991) can also be grouped with the class of discriminant analysis networks in discussion. Although supervision is used, the goal of LVQ is to find appropriate cluster centers in the input space rather than dimensionality reduction-enhancing class separability. In this article, a different method of achieving supervised nonlinear projection is implemented with a conventional, multilayered feedforward network. Unlike existing iterative approaches to nonlinear class separability
Supervised Networks That Self-Organize Class Outputs
639
enhancement, the presented implementation of supervised projection using neural networks enables the simultaneous development of the whole nonlinear mapping function along with the class labels. 3 Supervised Projection Using Neural Networks In this section, the learning algorithm that has been implemented is discussed briefly. For the sake of clarity, simple functions were chosen, although the basic idea can be extended to more complex and general functions. Let Opj be the activation state of the jth output after the presentation of pattern p. Let Äk be the set of inputs belonging to class k. Ep1 ,j refers to the error with respect to pattern p1 for output unit j. The within-class criterion used is ∀Äk , k = 1, . . . , n ∀p1 ∈ Äk . Minimize 1 X [Op1 j − Op2 j ]2 . 2 ∀p ∈Ä
Ep1 ,j =
2
(3.1)
k
Note that the within-class criterion is minimized when output responses for patterns belonging to the same class are identical—that is, when Op1 j = Op2 j . The between-class criterion is defined to be ∀Äk , k = 1, . . . , n ∀p1 ∈ Äk . Minimize X
Ep1 ,j =
1 − [Op1 j (1 − Op2 j ) + Op2 j (1 − Op1 j )].
(3.2)
∀p2 6∈Äk
Note that the between-class criterion is minimized when the output activations for patterns belonging to different classes are different—that is, when Op1 j 6= Op2 j . In a previous paper (Sarukkai 1994), I derived the following learning rule: 1Wij =
i X ∂Op1 j h η1 |Äk |[OÄk j − Op1 j ] + η2 |Äk |[1 − 2OÄk j ] , (3.3) ∂Wij ∀k,p ∈Ä 1
k
640
Ramesh R. Sarukkai
where η1 and η2 are learning rates.1 OÄk j is the thresholded, average output
activation of unit j for all patterns not belonging to class Äk , and |Äk | refers to the cardinality of the set Äk . OÄk j is the thresholded, average activation of unit j for class Äk , and |Äk | refers to the cardinality of the set Äk . Thresholded average activation is the threshold function applied to the average of the activations of a unit for input patterns belonging to the appropriate subset of the training data. The advantage of my formulation is that gradient weight adaptation rules can be easily implemented in a neural network backpropagation context. The derived weight adaptation rule does have some intuitive explanations. In the first criterion, it can be seen that the average activation of a unit for its class acts as the teaching signal. From the second criterion, it may be seen that if the average activation of a unit for “other classes” is greater than 0.5, then the term [1 − 2OÄk j ] is negative (equivalent to a teaching signal of 0 for the “current” class), and vice versa. Thus, the equations clearly express the two criteria lucidly. The cardinality terms may be viewed as weighing constants related to the training set size. In some cases, the within-class criterion and the between-class criterion are “opposing” quantities: if the class means for two classes are equal, then the effect of the between-class discernibility function is cancelled out by the within-class uniformity criterion. This problem can be alleviated by separating the training procedure into two phases, as shown in Table 1. The up arrow indicates that the value of the corresponding learning parameter is effectively higher, and the down arrow indicates that the value of the corresponding learning parameter is effectively lower during that learning phase. In the first phase, means of outputs for inputs belonging to the same class are computed. These class means are moved away from each other by applying the class discernibility phase of the algorithm, until the thresholded class means are orthogonal to each other. In practice, we find convergence. At this point, the within-class uniformity phase begins, setting a very low value for η2 (typically zero, although a nonzero value ensures that the orthogonality of the output class labels is not immediately lost) and a higher value for η1 . Ideally, the system is allowed to switch back and forth between the two phases until orthogonal output class means with very low intraclass distances are achieved. Thus, both the functional mapping and the output class labels are iteratively developed at the same time. Since average activation values for the patterns are required, an additional batch forward pass may be required, although the averages computed from the previous pass can be used. In practice, it is better to do pairwise class discriminative training.
1 Without loss of generality, the activations are assumed to be bounded (0,1) in this particular formulation.
Supervised Networks That Self-Organize Class Outputs
641
Table 1: Overview of the Algorithm Algorithm SOS: Self-organizing supervision with neural nets Class discernibility phase: Push between-class output cluster means away from each other until thresholded means for different classes are orthogonal. η2 ↑, η1 ↓ Class uniformity phase: Move within-class outputs toward class cluster means. η2 ↓, η1 ↑
Thus, the learning algorithm performs gradient descent on the sum of the between-class discernibility and within-class uniformity criteria; however, only one criterion has a dominant effect on the learning phase at a given time. It can be analytically proved that the maximization of the interclass distance defines an MSE (mean squared error) linear discriminant function (Young & Calvert 1974). Thus, convergence can be guaranteed when the selfcoding algorithm is applied to a linearly separable two-class problem using a single-layered perceptron. In other cases, once orthogonal class means are achieved and the self-organized codes stabilize, the method is equivalent to learning with fixed output labels; however, since the initial configuration of the system determines the output codes developed, it would be interesting to study whether the proposed scheme is less prone to local minima than traditional labeled supervision. The computational complexity of this batch algorithm is the same as traditionally labeled algorithms, if the required mean activations are computed from a previous epoch. For the experiments presented, orthogonality was generally achieved within a few hundred epochs, and convergence was achieved within a few thousand epochs. 4 Experiment I: Iris Data The idea was to test the classification of the three-class iris data set using the self-organization process detailed earlier. The data set contains three classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other two, but the latter are not linearly separable from each other. The input attributes are sepal length, sepal width, petal length, and petal width. The network consisted of four inputs, four hidden units, and three output units. Sixty trials were performed with random initial conditions (weight range [−3.7, 3.7]). Training proceeded until the total squared error was less than 1.0 (one trial failed to converge). In the 59 successful trials, the supervised self-organization scheme achieved an average classification of 98.96% (maximum, 99.33%; minimum, 98.67%). Sammon’s stress (PCA, LDA, and NDA; results are from Mao & Jain 1993) measures how well the projection preserves the interpattern distances
642
Ramesh R. Sarukkai
Table 2: Sammon’s Stress Obtained for the Iris Data Set Sammon’s stress
PCA
LDA
NDA
Self-organized
Average
0.147
0.519
1.513
0.510
(i.e., input space versus output space).2 Sammon’s stress (Sammon 1969) is defined as E=
1 n−1 n 6µ=1 6υ=µ+1 d∗ (µ, υ)
n−1 n 6µ=1 6υ=µ+1
[d∗ (µ, υ) − d(µ, υ)]2 , d∗ (µ, υ)
(4.1)
where d∗ (µ, υ) and d(µ, υ) are the (Euclidean) distances between pattern µ and pattern υ in the input space and in the projected space, respectively. It can be seen from Table 2 that good values for Sammon’s stress have been obtained by the nonlinear self-organized projections, suggesting that the distances between patterns in the input space are captured, to some degree (comparable to an LDA network), in the self-organized output class projections. 5 Experiment II: Multispeaker Syllable Classification In this experiment, described in more detail in Sarukkai (1994), four syllables were used: ba, pa, da, and ga. The total number of speaker-dependent speech tokens was 240; half were used for training and the other half for testing. Eighty speech tokens, from two speakers, which were not used in the training set, constituted the new speaker test set. A total of 24 Melscale filter bank coefficients were computed every 20 ms duration (window offset, 10 ms), and this constituted a total of 192 parameters, which were then linearly normalized. The network architecture consisted of 192 input units, one hidden layer of 50 units, and 6 output units. Although the data used in experiments II and III are fixed length, the algorithm can be applied in conjunction with dynamic input length architectures like time delay neural networks (TDNNs). The benchmark neural net was trained with 4-bit (one per class) and 2-bit (encoded) output codings. Euclidean distance was used for classification. Tables 3 and 4 show the results for the multispeaker syllable task for new speaker and trained speaker test sets (13 trials supervised self-organization; 15 trials labeled supervision).
2 LDA: linear discriminant analysis; NDA: nonlinear discriminant analysis; PCA: principal component analysis.
Supervised Networks That Self-Organize Class Outputs
643
Table 3: Multispeaker Syllable Recognition Results (in % with Standard Deviations)
Self-organizing backprop Normal backprop (2-bit output) Normal backprop (4-bit output)
New speakers
trained Speakers
67.69 ± 7.11 65.25 ± 7.34 69.17 ± 6.14
77.5 ± 2.75 77.11 ± 4.25 83.17 ± 3.19
Table 4: Some Output Labels Produced by the Class Separability Scheme (Experiment II) Token ba pa da ga
1
2
3
100100 100110 010001 010011
010101 110100 000011 100011
011110 110110 000001 100001
Note: The italics accentuate the differences in the output values.
6 Experiment III: Multimodal Data The database for this data set, described in more detail in deSa (1994), consists of auditory and visual features recorded from many speakers repeating one of the syllables: ba, va, da, ga, or wa. Multimodal data obtained from four speakers were used. The acoustic data were low-pass filtered and segmented automatically (using the time-domain wave magnitude program available in ESPS software from Entropic Research Laboratory). In order to ensure that all the consonantal information was retained, 50 ms from before and after the detected utterance was segmented. These utterances were then encoded using a 24-channel mel code over 20 ms windows overlapped by 10 ms. This gave a 216-dimension auditory code for each utterance. The visual frames were digitized as 64×64 8-bit gray-level images using the Datacube MaxVideo system (see Figures 1 and 2). The segmentation obtained from the acoustic signal was used to segment the video six frames before acoustically determined utterance onset and four after. The normal flow was
Figure 1: Example of da utterance.
644
Ramesh R. Sarukkai
Figure 2: Example of ga utterance.
Figure 3: Network architecture used for multimodal experiments.
computed using differential techniques between successive frames. Finally, using averaging techniques, five frames of motion over the 64 × 64–pixel grid were obtained. The frames were then divided into 25 equal areas (5 × 5) and the motion magnitudes within each frame averaged within each area. Thus, a final visual feature vector of dimension (5 frames ∗ 25 areas) 125 was obtained. Each input pattern consisted of the concatenated 216-dimensional auditory channel vector and the 125-dimensional visual parameter vector. The total number of patterns were 500. They were divided arbitrarily into 350 training samples and 150 test patterns, so that all speakers were represented in the training data. The network architecture used for this problem is shown in Figure 3. The benchmark was to train the network with two-per-class output coding. Results tabulated (see Table 5) are from 39 trials for the backpropagation with 2-per-class coding, and 42 trials for the self-organizing coding scheme. The
Supervised Networks That Self-Organize Class Outputs
645
Table 5: Test Set Results on the Multimodal Data Using Different Methods (%) VQ A:68,V:36
LVQ2.1
M-D
BP with output
Self-organize
A:97,V:60
A:92,V:58
B:93.98 ± 1.02
B: 91.97 ± 1.99
Note: A: auditory data only. V: Visual data only. B: Both modalities used.
results of the self-organizing scheme compare favorably with the labeled supervision scheme. The mean results for the same data set using Kohonen’s supervised LVQ, unsupervised VQ, and the M-D algorithms are cited from deSa (1994). For the LVQ and M-D tests, 30 codebook vectors were used for the auditory and 60 for the visual data. LVQ2.1 (which is supervised) has the best performance on the test sets, mainly because 30/60 codebook vectors seem sufficient to capture the 350 training samples well. In fact, the self-organization process goes further than combining features from both modalities. The closeness of patterns in the input space is reflected in the output encodings that the supervised self-organization scheme develops. Figure 4 shows the statistics of the interclass Hamming distances for the five classes as measured by the codes developed by the self-organizing network over 42 trials. The closest output coding (based on interclass Hamming distance) is (da, ga). This is interesting since experiments (Miller 1955) indicate that the syllables da and ga are perceptually often confused. Da and ga are also visually extremely alike (see Figures 1 and 2). It must be noted that although da, ba, and ga are all voiced stops, ba can be distinguished from the other two easily from the visual modality. In general, the place of articulation can be seen easily, but it is the hardest feature to hear. The pair with the next lowest interclass Hamming distance is the (va, ba) pair. Va is a voiced front fricative and is visually similar to ba. Although the production of ba requires that the lips be closed before the voice release, this visual cue is absent in the motion flow parameters that are computed from the visual images and used as inputs for the visual modality. The point of Figure 4 is to suggest that the self-organized codes are not totally random and seem to be biased by factors such as interpattern relationships in the input space. 7 Conclusions and Future Directions Supervised learning in conventional multilayered neural networks can be formulated without explicitly specifying output label functions. Learning rules were defined on the basis of between-class and within-class error criteria. The actual criteria could be extended to incorporate more general constraints such as error-correcting output labels. The proposed scheme has
646
Ramesh R. Sarukkai
Figure 4: Average interclass Hamming distances with standard deviations for the various class pairs developed by the self-organizing network.
been experimentally tested on various data sets. Interesting features emerge in the output representations of such self-organizing networks: some of the “intrinsic” or “natural” distances are somewhat preserved in the output projections. As a proof of concept, it has been shown that networks can selforganize in a supervised manner and develop their own interesting output labels. Appendix: Details of Experiments The cardinality terms were factored into the learning rates. α is the momentum factor. The activation function used is 1/(1+e−0.667∗x ). Sigmoid prime offset was 0.1. Weights initialized [−3.7 : 3.7]. I. Iris Experiment Number of classes: 3 Network architecture: 4 inputs, 4 hidden, 3 outputs Class discernibility phase parameters: η1 = 0.00; η2 = 0.01; α = 0.00
Supervised Networks That Self-Organize Class Outputs
647
Class uniformity phase parameters: η1 = 0.20; η2 = 0.00; α = 0.8 II. Multispeaker Syllable Experiment Number of classes: 4 Network architecture: 192 inputs, 50 hidden, 6 outputs Class discernibility phase parameters: η1 = 0.00; η2 = 0.09; α = 0.1 Class uniformity phase parameters: η1 = 0.08; η2 = 0.02; α = 0.6 Benchmark parameters: η = 0.1; α = 0.1 initially; α = 0.6 after 200 epochs III. Multimodal Data Experiment Number of classes: 5 Network architecture: 216 auditory plus 125 visual inputs, 40 hidden, 10 outputs Class discernibility phase parameters: η1 = 0.00; η2 = 0.01; α = 0.00 Class uniformity phase parameters: η1 = 0.20; η2 = 0.00; α = 0.8 Benchmark parameters: η = 0.20; α = 0.8 Acknowledgments I thank Dana H. Ballard for his valuable comments on this work. I am also grateful to Anil K. Jain for his important suggestions and pointers to related work. Also, thanks to Virginia deSa for providing the multimodal data. Thanks to the anonymous reviewers for their constructive criticism and insightful comments. This material is based on work supported by the National Science Foundation under grant IRI-8903582. The government has certain rights in this material. References Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth and Brooks. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Ann. Eugen., 7, 178–188. Fukunaga, K. (1972). Introduction to statistical pattern recognition. Orlando, FL: Academic Press.
648
Ramesh R. Sarukkai
Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Kuhnel, H., & Tavan, P. (1991). A network for discriminant analysis. In T. Kohonen, K. Makisara, O. Simula, & J. Kangas (Eds.), Artificial neural networks. New York: Elsevier Science. Mao, J., & Jain, A. K. (1993). Discriminant analysis neural networks. In Proceedings of the IEEE International Conference on Neural Networks, San Francisco, Vol. 1, pp. 300–305. Miller, R. G. (1955). Perceptual confusions among consonants. Journal of the Acoustical Society of America, 27, 338–352. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1987). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Sammon, J. R., Jr. (1969). A non-linear mapping for data structure analysis. IEEE Trans. Comput., C-18, 401–409. Sarukkai, R. (1994). Supervised learning without output labels. (Tech. Rep. No. 510). University of Rochester, Department of Computer Science. deSa, V. (1994). Unsupervised classification learning from cross-modal environmental structure. Doctoral dissertation, URCS TR#536, University of Rochester, Rochester, NY. Young, T. Y., & Calvert, T. W. (1974). Classification, estimation, and pattern recognition. New York: American Elsevier Publishing Company.
Received November 14, 1994; accepted May 6, 1996.
Communicated by Lawrence Abbott
How Well Can We Estimate the Information Carried in Neuronal Responses from Limited Samples? David Golomb Zlotowski Center for Neuroscience and Department of Physiology, Faculty of Health Sciences, Ben-Gurion University of the Negev, Beer-Sheva, Israel
John Hertz Nordita, Copenhagen, Denmark
Stefano Panzeri Department of Experimental Psychology, Oxford University, Oxford OX1 3UD, U.K.
Alessandro Treves SISSA, Biophysics and Cognitive Neuroscience, Trieste, Italy
Barry Richmond Laboratory of Neuropsychology, NIMH, NIH, Bethesda, MD, USA
It is difficult to extract the information carried by neuronal responses about a set of stimuli because limited data samples result in biased estimates. Recently two improved procedures have been developed to calculate information from experimental results: a binning-and-correcting procedure and a neural network procedure. We have used data produced from a model of the spatiotemporal receptive fields of parvocellular and magnocellular lateral geniculate neurons to study the performance of these methods as a function of the number of trials used. Both procedures yield accurate results for one-dimensional neuronal codes. They can also be used to produce a reasonable estimate of the extra information in a three-dimensional code, in this instance, within 0.05–0.1 bit of the asymptotically calculated value—about 10% of the total transmitted information. We believe that this performance is much more accurate than previous procedures. 1 Introduction Quantifying the relation between neuronal responses and the events that have elicited them is important for understanding the brain. One way to do this in sensory systems is to treat a neuron as a communication channel (Cover & Thomas 1991) and to measure the information conveyed by the Neural Computation 9, 649–665 (1997)
c 1997 Massachusetts Institute of Technology °
650
David Golomb et al.
neuronal response about a set of stimuli presented to the animal. In such experiments (e.g., Gawne & Richmond 1993; McClurkin et al. 1991; Tov´ee et al. 1993) a set of S sensory (e.g., visual) stimuli is presented to the animal, each stimulus being presented for NS trials. After the neuronal response is quantified in one of several ways (e.g., the number of spikes in a certain time interval or a descriptor of the temporal course of the spike train), the transmitted information (mutual information between stimuli and responses) is estimated. This approach is useful for investigating issues such as the resolution of spike timing (Heller et al. 1995), the effectiveness of encoding for stimulus sets (Optican & Richmond 1987; McClurkin et al. 1991; Rolls et al. 1996a, 1996b), and the relations between responses of different neurons (Gawne & Richmond 1993). The equation for transmitted information can be written in several ways, including: ¸ · Z X P(s, r) P(s, r) log2 I(S; R) = dr P(s)P(r) s * =
X s
· P(s | r) log2
P(s | r) P(s)
¸+ .
(1.1)
r
Here P(s, r) is the joint probability of stimulus s and response r, P(s | r) is the conditional probability of stimulus s given response r, and P(s) and P(r) are the unconditional stimulus and response probabilities, respectively. The stimulus index s is discrete, as there is a finite number of stimuli. The response measure r can be either discrete or continuous, depending on the way the response is quantified. The notation h· · ·ir in the second form indicates an average over the (unconditional) response distribution. Computing I(S; R) requires estimates of these joint or conditional probabilities and carrying out the appropriate integration over response space or the average over the response distribution. A large number of trials is needed for accurate information calculations. When the number of data samples is small, there are systematic errors in estimating the transmitted information by direct application of the formal definition, that is, by binning responses and estimating the relevant probabilities as the empirical bin frequencies (Miller 1955; Carlton 1969; Optican et al. 1991; Treves & Panzeri 1995; Abbott et al. 1996; Panzeri & Treves 1996). With a small number of data samples, the information is overestimated; the information is biased upward. Intuitively, this can be understood by noting that the number of bins can exceed the number of data samples, especially as the dimensionality of the response increases. The data samples must be evenly distributed across the bins to yield zero information, a circumstance that cannot occur when the number of bins exceeds the number of data samples. On the other hand, if the number of bins is too small, responses that should be reliably distinguishable will not be, leading to information values
Estimating Information from Neuronal Responses
651
that underestimate the true values. These effects are magnified when the dimensionality of response increases, and thus even more samples are needed to make accurate information estimates when the response dimensionality is larger than one. In practice, the size of data sets is limited by experimental constraints. Thus it is important to try to correct for these systematic errors and to design experiments so that a sufficient number of data samples are obtained to answer reliably the questions posed. Only then can we make effective use of mutual information measurements. Here, we test procedures designed to overcome the systematic errors that arise with limited data sets by comparing mutual information calculations from large data sets with those calculated from smaller ones. Currently the only way to get such large data sets is through simulations. Artificial spike trains are created using a model of the response of lateral geniculate nucleus (LGN) neurons (Golomb et al. 1994). The model is based on experimentally measured spatiotemporal receptive fields of LGN neurons (Reid & Shapley 1992). Two procedures are tested here. The first is a binning-and-correcting procedure; the information is calculated by direct application of the first definition in equation 1.1, and a correction value is estimated and subtracted (this method attacks the bias problem) (Panzeri & Treves 1996; Treves & Panzeri 1995). The second method uses a neural network to estimate the conditional probabilities in the second definition in equation 1.1 directly (Heller et al. 1995; Hertz et al. 1992; Kjaer et al. 1994). We find that both methods yield accurate results for one-dimensional codes, even for a relatively small number of samples. Moreover, estimates of the extra information carried in three-dimensional codes are also reasonable, within 0.05–0.1 bits (about 10%) of the correct values. 2 Methods 2.1 Producing Simulated Data. The spike trains are created using a model of the response of parvocellular and magnocellular LGN cells, as described in Golomb et al. (1994). In brief, the spatiotemporal receptive fields R(Er, t) of the two cells types in response to an impulse in space and and time were measured (Reid & Shapley 1992; data presented in Figure 1 of Golomb et al. 1994). The set of stimuli {σs } , s = 1 · · · S(=32) includes 4 × 4 flashed Walsh figures in space and their contrast reverse, σs (Er, t) = us (Er)2(t),
(2.1)
where us (Er) are the spatial Walsh figures and 2(t) is the Heavyside function (2(t) = 1 if t > 0 and is 0 otherwise). The ensemble-average response to the sth figures Zs (t) is calculated by centering it on the receptive field center, convolving it with the spatiotemporal receptive field, adding the constant
652
David Golomb et al.
baseline Z0 corresponding to the spontaneous firing, and rectifying at zero response: · Zs (t) = 2 Z0 +
Z
Z dEr
t
−∞
¸ ¡ ¢ ¡ 0¢ 0 dt R Er, t − t σ Er, t . 0
(2.2)
The response Zs (t) for Walsh figures is shown in Figure 5 of Golomb et al. (1994). Realizations of spike trains are created at random with inhomogeneous Poisson statistics, using the average response as the instantaneous rate. The probability density of obtaining a spike train 3s (t), with k spikes at times t1 . . . tk during a measurement time T, is ¢ ¡ P (3s (t) | σs ) = P t1 . . . tk | σs (Er, t) " ! # Ã Z k T ¡ ¢ 1 Y 0 0 = Zs (ti ) exp − Z t dt . k! i=1 0
(2.3)
A set of 1024 simulated responses for each of the 32 stimuli is used for testing the information calculation procedures. The asymptotic estimate of transmitted information is calculated using 106 trials per stimulus. 2.2 Response Representation. The neuronal response to a stimulus as represented by the spike train is quantified by several variables. One is the number of spikes (NOS) in the response time interval, taken here to be 250 ms. The others are the projection of the spike train into the n principal components (PCs) (Richmond & Optican 1987; Golomb et al. 1994). We concentrate here on the first principal component (PC1) and the first three principal components (PC123). The four-dimensional code composed of the first three principal components and the number of spikes (PC123s) is also considered. 2.3 Information Estimation. We describe here briefly some of the methods that can be used for estimating information from neuronal responses. We simulate an experiment in which a set of S stimuli is presented at random. Each stimulus is shown Ns times; here Ns is the same for all the stimuli. The total number of visual stimuli presented is N = SNs . 2.3.1 Summation over the Poisson Distribution. For each stimulus here, the number of spikes NOS is Poisson distributed with an average NOS = RT 0 Zs (t)dt. The asymptotic value of the transmitted information carried by the NOS can be calculated directly by summing over the distribution.
Estimating Information from Neuronal Responses
653
Equation 1.1 becomes X
I (S; NOS) = −
P (NOS) log2 P (NOS)
(2.4)
NOS
+
1 XX P (NOS | s) log2 P (NOS | s) . Ns s NOS
This sum is discrete and is calculated using the Poisson probability distribution P (NOS | s). The sum over NOS from 1 to ∞ is replaced by a sum from 1 to NOSmax = 36; taking a higher NOSmax has only a negligible effect on the result. Using this method, the mutual information can be calculated exactly, but only when the firing rate distribution is known. In real experiments, the firing rate distribution is unknown, and therefore the methods described below should be used. 2.3.2 Straightforward Binning (Golomb et al. 1994). The principal components used here were calculated from the covariance matrix C(t, t0 ) formed over all responses in the set under study Ns S X £ ¤£ ¤ ¡ ¢ 1 X ¯ ¯ 0) , 3s,µ (t) − 3(t) 3s,µ (t0 ) − 3(t C t, t0 = SNs s=1 µ=1
(2.5)
where 3s,µ (t) is the µth realization of the response to the sth stimulus and ¯ 3(t) is the average response over all the stimuli and realizations ¯ 3(t) =
Ns S X 1 X 3s,µ (t). SNs s=1 µ=1
(2.6)
The eigenvalues of the matrix C are labeled according to a decreasing order; the corresponding eigenvectors are 81 (t), 82 (t). . . . The expansion coefficients of the neuronal response 3s,µ (t) are given by as,µ,m =
1 T
Z
T
dt 3s,µ 8m (t).
(2.7)
0
Each response is then quantified using the coefficients to the first n principal components, and these are used as the response representation. The number n of coefficients used for quantifying the response is referred to here as the code dimension. The maximal and minimal values for each component are found, and the interval between the minimum and the maximum of the mth component is divided into R(m) bins. The mutual information is calculated from the discrete distribution obtained. The N-dimensional reQ sponse space is therefore divided into R = nm=1 R(m) N-dimensional bins.
654
David Golomb et al.
For PC123, we choose R(1) = 36, R(2) = 20, and R(3) = 20. The mutual information carried by the first principal component only, PC1, is calculated in a similar way with R(1) = 36. Note that binning is a simple form of regularization, and some kind of regularization is always needed when response measurements span a continuous range (the number of spikes is discrete). Regularization results in a downward bias of the calculated information value (see section 1). For PC1 and our simulated data set, using R(1) = 36 results in underestimating the mutual information by ∼ 0.01 bit in comparison to R(1) = 300. 2.3.3 Binning with Finite Sampling Correction. improved by doing the following:
The binning method is
1. For each dimension, equipopulated bins are used. For a one-dimensional code (NOS, PC1), the bin size varies across the response dimension, with nonequal spacing, so that each bin gets on average the same number of counts. For a three-dimensional code (PC123), the equipopulated binning is done for each dimension separately (Panzeri & Treves 1996). 2. The systematic error due to limited sampling can be expanded analytically in powers of 1/N. Treves and Panzeri (1995) have shown that the first-order correction term C1 carries almost all the error, provided that N is large enough. Therefore, we correct the information estimation by subtracting C1 from the result calculated by raw equipopulated binning. The term C1 is expressed as (Panzeri & Treves, 1996) ( ) X 1 es − R − (S − 1) , R C1 = 2Nln 2 s
(2.8)
es is the number of “relevant” response bins for the trials with where R stimulus s. es differs from the total number of bins R The number of relevant bins R allocated because some bins may never be occupied by responses to a particular stimulus. Thus, if the term C1 is calculated using the total number of bins R for each stimulus, the systematic error is overestimated whenever there are stimuli that fail to elicit responses spanning the full response set. The number of relevant bins also differs from the number of bins actually occupied for each stimulus (with few trials), Rs , because more trials might have occupied additional bins. Again, C1 is underestimated if Rs is used when only a few trials are available (the underestimation becoming es for all stimuli). negligible for R/Ns ¿ 1 because Rs tends to coincide with R es (a number between Rs and R) from the data by assumWe estimate R ing that the expectation value of the number of occupied bins should be es . The precisely Rs given the number of trials available and the estimate R
Estimating Information from Neuronal Responses
655
Table 1: Number of Bins R Used for Codes and Numbers of Trials Ns Ns
16
32
64
128
R for NOS R for PC1 R = R(1) × R(2) × R(3) for PC123
16 16
1 + NOSmax 36
1 + NOSmax 63
1 + NOSmax 128
4×2×2
6×3×2
7×3×3
8×4×4
algorithm for achieving this result is explained in Panzeri and Treves (1996). This method for correcting the information values in the face of a limited number of data samples yields reasonable results even up to R/Ns ' 1, the region in which we use the method for the study presented here. Equation 2.8 depends on the probability distributions much more weakly than does the mutual information itself because the dependence is only es have to es . Therefore, although the parameters R through the parameters R be estimated from the data, this procedure leads to better accuracy. The choice of the number of bins R(m) in each dimension for an experiment with S stimuli and Ns trials per stimuli remains somewhat arbitrary. Here we choose R ∼ Ns , to be at the limit of the region where the correction procedure is expected to work, and thus still be able to control finite sampling, while minimizing the downward bias produced by binning into too few bins. For NOS, however, each response is just an integer ranging from 0 to the maximal number of spikes (NOSmax —25 for the parvocellular cell and 34 for the magnocellular cell), so even if we allocate more bins than this maximum, the extra ones will stay empty. For a multidimensional code (e.g., PC123), we allocate a number of bins R(m) in the mth direction in relation to the amount of mutual information carried by this principal component alone, as shown in Table 1. When differences between different codes are calculated, we use the same number of bins in the relevant dimension. When PC1 and NOS are compared, we use the same number, R, as for PC1; in this case, many bins for NOS stay empty. For comparing PC123 and NOS, we use the same number of bins for NOS as for the first principal component, which is the richest in information among the three (e.g., 8 for Ns = 128). In this way we compare quantities calculated in a homogeneous way. 2.3.4 Neural Network. A two-layer network is trained by backpropagation to classify the neuron’s responses according to the stimuli that elicited them (Hertz et al. 1992; Kjaer et al. 1994). The network uses sigmoidal activation for the nodes in the hidden layer and exponential activations for the nodes in the output layer, with the sum of the outputs normalized to one after each step. The input to the network is the quantified output of the biological neuron: the spike count, the first n principal components, or both. There is one output unit for each stimulus, and, after training,
656
David Golomb et al.
ˆ the value of output unit number s is an estimate P(s|r) of the conditional probability P(s | r). The mutual information is then calculated from these estimates using the second form of equation 1.1. The average over the response distribution is estimated by sampling over randomly chosen data points. All other parameters of the algorithm—notably the number of training iterations and the input representation, are controlled by cross-validation. The data are divided into training and test sets, and for each representation, the training is stopped when the test error, defined as E=−
X µ
ˆ µ|rµ ) log2 P(s
(2.9)
(the negative log-likelihood or crossed-entropy), reaches a minimum. In equation 2.9, the index µ labels the trials in the test set, and sµ is the stimulus that actually evoked the response rµ observed in that trial. After carrying out this procedure for four train-test data splits, the optimal response representation is chosen as the one with the lowest average test error. In the present calculations, as in previous work using the neuronal responses from neurons in the monkey visual system (Heller et al. 1995), the optimal representation consists of the spike count plus the first three principal components. Including the spike count leads to smaller crossed-entropy values than those obtained from the three PCs alone. 3 Results We calculated the information carried about a set of 32 Walsh patterns by the simulated neuronal response quantified by the number of spikes I (S; NOS), the first principal component I (S; PC1), and the first three principal components I (S; PC123) (see Figure 1). For the network technique, we calculated also the information carried by the first three principal components and the spike count together I (S; PC123s) (see Figure 2). The arrows at the right side of the panels in Figure 1 represent the asymptotic values calculated from simple binning using 106 trials per stimulus. These figures show the estimated transmitted information for Ns = 16, 32, 64, 128, and for two sample cells: magnocellular and parvocellular. The parvocellular cell has sustained activity over an interval of 250 ms, whereas the magnocellular cell is active mostly over the first 100 ms. The magnocellular cell has more phasic responses (Golomb et al. 1994). Thus, we expect the multidimensional codes to capture a larger proportion of the information in its responses. A simple rate code is more likely to be an acceptable zeroth-order description of the parvocellular cell. The results we obtain (compare panels A and C, D and F, respectively, in Figure 1) bear this expectation out. All of the calculations show that the first principal component is more informative about the stimulus than the spike count. As expected, this effect
Estimating Information from Neuronal Responses
657
Figure 1: The information about a stimuli set of 32 Walsh figures conveyed by the neuronal response of model parvocellular (A–C) and magnocellular (D–F) cells, as estimated by various methods. The response is quantified by the number of spikes (A,D), the first principal components (B,E), and the first three principal components (C,F). The mutual information is estimated by straightforward equipopulated binning (dotted lines), equipopulated binning with finite sampling correction (dashed lines), and a neural network (solid line). The numbers of bins for each code and Ns are shown in Table 1. The neural network has six hidden units and a learning rate of 0.003. The arrows at the right indicate an asymptotic value (very good approximation for the “true” value) obtained with equispaced binning with 36 × 20 × 20 bins and 106 trials per stimulus.
658
David Golomb et al.
Figure 2: The information about a stimuli set of 32 Walsh figures conveyed by the neuronal response of model parvocellular (A) and magnocellular (B) cells. The response is quantified by the first three principal components with (solid line) and without (dotted line) the number of spikes. The mutual information is estimated by a neural network with six hidden units and a learning rate of 0.003. I (S; PC123) is shown also in Figures 1C and 1F.
is especially strong for the magnocellular cell, as the first PC weighting function suppresses contributions from the spikes after the first 100 ms, which are mainly noise. Figure 1 shows that the raw binning is strongly biased upward. The difference between the estimates made with raw and corrected binning almost does not vary with Ns , for PC1 and PC123. This is because the firstorder correction term (see equation 2.8) is approximately proportional to R/Ns , which we choose to keep roughly constant in our calculation. For NOS there is no point choosing R above 1+NOSmax ; hence the correction term, and with it the raw estimate, decreases as Ns is increased. Both the corrected binning method and the network method tend to underestimate the information in PC1 and PC123 (the only counterexample is shown in Figure 1E, where the binning method overestimated it). This is because both methods involve a regularization of the responses (explicit in the binning, and done implicitly by the network), and regularization always decreases the amount of information present in the raw response. For example, the corrected binning method underestimates the information whenever the number of bins is too small to capture important features of the probability distribution of the responses. Therefore, the effect is strong for PC123 when the number of bins in the direction of the first PC is not large enough, and the bias downward decreases with increasing Ns because the number of bins increases too. The underestimation does not occur with NOS
Estimating Information from Neuronal Responses
659
because the maximal number of spikes in our examples is around 30, and there is no meaning to using finer binning. In general, the underestimation due to regularization is more prominent for the higher-dimensional code (PC123) because the effect of adding more bins in each dimension is stronger when the number of bins is small. The neural network tends to find a larger amount of information carried by the first three principal components together with the spike count than by the first three principal components alone (Heller et al. 1995). For the parvocellular cell, I (S; PC123s) exceeds I (S; PC123) by 0.025 bit (2.6%) for Ns = 64 and by 0.008 bit (0.8%) for Ns = 128; for the magnocellular cell, the extra information is 0.03 (7%) for Ns = 64 and 0.04 bit (9%) for Ns = 128; compare the solid lines in Figures 1 and 2. This increase occurs despite the high correlation between the first principal component and the number of spikes, especially for the parvocellular cell. In Figure 3 we present the differences I (S; PC1) − I (S; NOS) between the information carried by PC1 and NOS, and I (S; PC123) − I (S; NOS), between PC123 and NOS. For all the cases considered here, the network yields a value for the extra information that is biased downward. This shows that the network automatically regularizes responses, and apparently the regularization is stronger for the higher-dimensional code. The corrected binning technique, on the other hand, gives both downward- and upwardbiased values for the extra information—in this instance, downward for the parvocellular cell. Since the choice of number of bins along each dimension is somewhat arbitrary for a multidimensional code, we checked the effect of using different binning schemes for Ns = 128. The results are summarized in Table 2A. The information differences for the parvocellular cells are in a 30% range; the information differences for the magnocellular cells are all nearly the same, no matter which method is used. Thus, even in the least favorable case, information differences using different binning schemes remain in the range of the remaining (downward) systematic error—about 30%. In a similar way, we varied the parameters of the network: the number of hidden units and the learning rate (see Table 2B). The difference in information varies within 10% for both cells, indicating that changes in these parameters are less important than the downward bias due to the regularization. The test error for the various network parameters is quite similar, with differences within 0.4% for both cell types. Thus, it is difficult to determine the best result of the network from just this number. 4 Discussion The main result of this work is that information carried in the firing of a single neuron during a certain time interval about a set of stimuli, and quantified by a certain code, can be estimated with a reasonable accuracy (within 0.05–0.1 bit, or 10%) using either of the two procedures described
660
David Golomb et al.
Figure 3: Differences between measures used for quantifying the response of a parvocellular cell (A,B) and a magnocellular cell (C,D). We show in A and C the difference between the information carried by the first PC and that carried by the number of spikes, and in B and D the difference between the information in the first three PCs and that in the number of spikes. Dashed lines indicate results obtained from normalized equipopulated binning; solid lines indicate those from the neural network. In A and C, the number of bins is 16 for N = 16 and 36 for N ≥ 32. In B and D the number of bins in the first three principal components is as given in Table 1, and the number of bins for NOS is equal to R(1) in that table, that is, to the number of bins for the first PC (4, 6, 7, and 8, respectively). The arrows at the right indicate an asymptotic value obtained with equispaced binning with 36 × 20 × 20 bins and 106 trials per stimulus.
above. This result is achieved with the temporal response quantified by a three-dimensional code (PC123). With a one-dimensional code (NOS or PC1) the accuracy is even better. The relatively good accuracy obtained with our methods enables one to compare various codes and determine which one carries the most information (e.g., Heller et al. 1995). To understand the results in more detail, it is important to remember that information measures are specific to the stimulus set considered, but also specific to the quantities chosen to quantify neuronal responses; also,
Estimating Information from Neuronal Responses
661
Table 2: Difference in Mutual Information I (S; PC123) − I (S; NOS) (Bits) for Ns = 128 A. Several Binning Schemes R = R(1) × R(2) × R(3)
7×6×3
8×4×4
10 × 4 × 3
15 × 3 × 3
Parvocellular cell Magnocellular cell
0.145 0.289
0.135 0.289
0.125 0.291
0.114 0.289
B. Several Network Schemes Hidden units, learning rate
6, 0.0003
6, 0.001
10, 0.001
6 , 0.003
Parvocellular cell Magnocellular cell
0.121 0.219
0.127 0.201
0.123 0.219
0.118 0.206
Note: The standard deviations of the difference are about 0.01.
information measures are affected by limited sampling, which results typically in an upward bias, but also, since it is always necessary to regularize continuous responses, they may be affected by the regularization, which results in a bias downward. When a simple binning of the responses was used, raw information measures were strongly biased upward, and thus it was important to apply a correction for limited sampling. If one follows this procedure, the only parameter that has to be set is R, the number of response bins. If R is chosen too large, subtracting the term C1 , equation 2.8, will not be enough to correct for limited sampling (see also Treves & Panzeri 1995); if it is chosen too small, a strong regularization will be imposed, and information will be underestimated. The results indicate that it is sensible to set the number of response bins at roughly the number of trials per stimulus available, R ' Ns . C1 is inversely proportional to the number of trials available and roughly directly proportional to the number of response bins, and this choice approximately balances the upward bias due to finite sampling with the downward one due to the regularization. Choosing the number of bins for each of the first three principal components in PC123 was more delicate that for NOS or PC1 and tended to yield a stronger downward bias in information values. This suggests that the use of the binning procedure alone becomes insufficient for higher-dimensional codes, as when the spike trains of several cells are considered together.1
1 One useful procedure for the computation of information from multiple neurons is to use decoding to extract the relative probabilities of the stimuli from the responses and thus to reduce the original set of responses to the size of the stimulus set (Gochin et al. 1994; Rolls et al. 1996a, 1996b).
662
David Golomb et al.
For all the cases considered, the network underestimated both the mutual information and the extra information in the temporal response in comparison to the number of spikes. This indicates that the regularization induced by the network is enough to dispose of the finite sampling bias, at the price of underestimating information values, especially for higher-dimensional codes, which are more strongly regularized.2 Information values generally increase weakly with N, which indicates that the regularization induced has decreasing effects as N becomes large. Since the underestimation is stronger for I(S;PC123) than in the case of unidimensional codes, estimates of the extra information in the second and third principal components (see Figure 3) are also biased downward. Several parameters need to be set when the network is used. Some (e.g., the number of iterations) can be set by crossvalidation. Others (e.g., the learning rate) have little effect on the results across a broad range of values, as indicated in Table 2. Another virtue of the network procedure is that it effectively incorporates a decoding step and as such can be immediately applied to high-dimensional (e.g., multiple single-unit) data. Although our results were obtained with simulated data, we believe that the conclusions apply to real data as well. In particular, our network results are consistent with those of Kjaer et al. (1994, Figure 10), obtained from complex V1 cells. They also found that the estimated mutual information increases with Ns and almost reaches a plateau for Ns ≈ 15 for the first principal component but not for the first three principal components. In this work, principal components are used for quantifying the data with a low-dimensional code, because the first principal components carry most of the difference among the responses to different stimuli. However, principal component analysis is not essential to our procedures for handling finite sampling problems; any kind of N-dimensional response extracted from the neuronal firing patterns can be used (e.g., PC123s, the NOS of several neurons, or neuronal spiking times; Heller et al. 1995). Estimating the accuracy of the methods can be done only for n ≤ 3, because for higher n, calculating the asymptotic value of the transmitted information using straightforward binning is too consuming of computer time. Our procedures of finite sampling corrections were demonstrated here on stimuli with a sharp onset in time. They are also applicable, however, to continuously changing stimuli (after a suitable discretization), as long as the response to each stimulus is measured during a fixed time interval T, and stimuli are either discrete or have been discretized. Important results have been obtained by Bialek and collaborators by extracting information measures from neuronal responses (Bialek 1991; Bialek et al. 1991; de Ruyter van Steveninck & Laughlin 1996; Rieke et al. 1993). These results are based on invertebrate (insect) data, which are usually easy
2
One could even compute the corresponding C1 term (Panzeri & Treves 1996).
Estimating Information from Neuronal Responses
663
to collect in large quantities, and thus the limited sampling artifacts are less severe. Our analysis is particularly relevant to primate data, which are typically much less abundant. Note, however, that a recent paper (Strong et al. 1996) analyzes finite sampling effects following the method of binning with finite size correction (Treves & Panzeri 1995). A second, less crucial difference between the type of experimental data considered here and that used by those investigators is that the stimulus in their case is a continuous quantity, which opens up the possibility of using notions such as the linearized limit and the gaussian approximation that do not apply to our situation with a discrete nonmetric set of stimuli and that when applicable can alleviate finite sampling effects further. The network procedure needs a long computational time. A typical calculation for Ns = 128 and 32 stimuli runs for about 7 CPU hours on an SGI-ONYX computer. The binning-and-correcting technique is much faster, and most of the CPU time is taken up by sorting responses in order to construct equipopulated bins. For I(S;NOS), which involves no sorting, a calculation with Ns = 128 runs for less than 1.8 seconds on an HP-Apollo computer. What is the minimum number of trials to plan for in an experiment? In our instance, results seemed reasonable when Ns = 32. However, this number depends on the stimuli, response representation, and neurons used. The argument sketched above indicates how to estimate the minimum number of trials in other experiments. The binning-and-correcting procedure functions reasonably up to Ns ' R, and the minimum R that may, if the appropriate type of response is chosen, not throw away much information,3 is R = S. Therefore a minimum of Ns = S trials per stimulus is a fair demand to be made on the design of experiments from which information estimates are going to be derived. Acknowledgments AT and SP have developed their procedure using data from the labs of Edmund Rolls, Bruce McNaughton, and Gabriele Biella, all of which was extremely useful. Partial support was from CNR and INFN (Italy). References Abbott, L. F., Rolls, E. T., & Tovee, M. J. (1996). Representational capacity of face coding in monkeys. Cerebral Cortex, 6, 498–505. Bialek, W. (1991). Optimal signal processing in the nervous system. In W. Bialek (Ed.), Princeton lectures on biophysics. London: World Scientific.
3 R = S is the effective R used when applying a decoding procedure, as with multiunit data by Rolls et al. (1996b).
664
David Golomb et al.
Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Carlton, A. G. (1969). On the bias of information estimate. Psych. Bull., 71, 108– 109. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. de Ruyter van Steveninck, R. R., & Laughlin, S. B. (1996). The rates of information transfer at graded-potential synapses. Nature, 379, 642–645. Gawne, T. J., & Richmond, B. J. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? J. Neurosci., 13, 2758– 2771. Gochin, P. M., Colombo, M., Dorfman, G. A., Gerstein, G. L., & Gross, C. G. (1994). Neural ensemble encoding in inferior temporal cortex. J. Neurophysiol., 71, 2325–2337. Golomb, D., Kleinfeld, D., Reid, R. C., Shapley, R. M., & Shraiman, B. I. (1994). On temporal codes and the spatiotemporal response of neurons in the lateral geniculate nucleus. J. Neurophysiol., 72, 2990–3003. Heller, J., Hertz, J. A., Kjaer, T. W., & Richmond, B. J. (1995). Information flow and temporal coding in primate pattern vision. J. Comp. Neurosci., 2, 175–193. Hertz, J. A., Kjaer, T. W., Eskander, E. N., & Richmond, B. J. (1992). Measuring natural neural processing with artificial neural networks. Int. J. Neural Syst., 3 (suppl.), 91–103. Kjaer, T. W., Hertz, J. A., & Richmond, B. J. (1994). Decoding cortical neuronal signals: Networks models, information estimation and spatial tuning. J. Comp. Neurosci., 1, 109–139. McClurkin, J. W., Optican, L. M., Richmond, B. J., & Gawne, T. J. (1991). Concurrent processing and complexity of temporally encoded neuronal messages in visual perception. Science, 253, 675–677. Miller, G. A. (1955). Note on the bias of information estimates. Info. Theory Psychol. Prob. Methods, II-B, 95–100. Optican, L. M., Gawne, T. J., Richmond, B. J., & Joseph, P. J. (1991). Unbiased measures of transmitted information and channel capacity from multivariate neuronal data. Biol. Cybernet., 65, 305–310. Optican, L. M., & Richmond, B. J. (1987). Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex: II. Quantification of response waveform. J. Neurophysiol., 57, 147–161. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network, 7, 87–107. Reid, R. C., & Shapley, R. M. (1992). Spatial structure of cone inputs to receptive fields in primate lateral geniculate nucleus. Nature, 356, 716–718. Richmond, B. J., & Optican, L. M. (1987). Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex: III. Information transmission. J. Neurophysiol., 57, 162–178. Rieke, F., Warland, D., & Bialek, W. (1993). Coding efficiency and information rates in sensory neurons. Europhys. Lett., 22, 151–156.
Estimating Information from Neuronal Responses
665
Rolls, E. T., Treves, A., Tov´ee, M. J., & Panzeri, S. (1996a). Information in the neuronal representation of individual stimuli in the primate temporal visual cortex. Submitted. Rolls, E. T., Treves, A., & Tov´ee, M. J. (1996b). The representational capacity of the distributed encoding of information provided by populations of neurons in the primate temporal visual cortex. Experimental Brain Research. In press. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., & Bialek, W. (1996). Entropy and information in neural spike trains. Los Alamos archives cond-mat 9603127. Submitted. Tov´ee, M. J., Rolls, E. T., Treves, A., & Bellis, R. P. (1993). Information encoding and the response of single neurons in the primate temporal visual cortex. J. Neurophysiol., 70, 640–654. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Comp., 7, 399–407.
Received February 5, 1996; accepted June 24, 1996.
Communicated by Peter Dayan
Covariance Learning of Correlated Patterns in Competitive Networks Ali A. Minai Department of Electrical and Computer Engineering and Computer Science, University of Cincinnati, Cincinnati, OH 45221 USA
Covariance learning is a powerful type of Hebbian learning, allowing both potentiation and depression of synaptic strength. It is used for associative memory in feedforward and recurrent neural network paradigms. This article describes a variant of covariance learning that works particularly well for correlated stimuli in feedforward networks with competitive K-of-N firing. The rule, which is nonlinear, has an intuitive mathematical interpretation, and simulations presented in this article demonstrate its utility. 1 Introduction Covariance-based synaptic modification for associative learning (Sejnowski 1977) is motivated mainly by the need for synaptic depression as well as potentiation. Long-term depression (LTD) has been reported in biological systems (Levy & Steward 1983; Stanton and Sejnowski 1989; Fujii et al. 1991; Dudek & Bear 1992; Mulkey and Malenka 1992; Malenka 1994; Selig et al. 1995) and is computationally necessary to avoid saturation of synaptic strength in associative neural networks. In the recurrent neural networks literature, covariance learning has been proposed for the storage of biased patterns, that is, binary patterns with probability(bit i = 1) 6= probability(bit i = 0) (Tsodyks and Feigel’man 1988), and this rule is provably optimal among linear rules for the storage of stimulus-response associations between randomly generated patterns under rather general conditions (Dayan & Willshaw 1991). However, the prescription is not as successful when patterns are correlated, that is, probability(bit i = 1) 6= probability(bit j = 1). The correlation causes greater similarity between patterns, making correct association or discrimination more difficult. I present a covariance-type learning rule that works much better in this situation (Green & Minai 1995, 1996). The motivation for looking at correlated patterns arises from their ubiquity in neural network settings. For example, if the patterns for an associative memory are characters of the English alphabet on an n×n grid, some grid locations (e.g., those representing a vertical bar on the left edge) will be much more active than others (e.g., those corresponding to a vertical bar on the Neural Computation 9, 667–681 (1997)
c 1997 Massachusetts Institute of Technology °
668
Ali A. Minai
right edge). The same argument applies to more complicated stimuli, such as images of human faces. This article mainly addresses another common source of correlation: processing of (possibly initially uncorrelated) patterns by one or more prior layers of neurons with fixed weights. This situation arises, for example, whenever the output of a “hidden” layer needs to be associated with a set of responses. The results also apply qualitatively to other correlated situations. The results presented here are limited to the case where all neural firing is competitive K-of-N. This type of firing is widely used in neural modeling to approximate “normalization” of activity in some brain regions by inhibitory interneurons (e.g., Sharp 1991; O’Reilly & McClelland 1994; Burgess et al. 1994). The relationship with the noncompetitive case (e.g., Dayan & Willshaw 1991) is discussed but is not the focus of this article. 2 Methods The learning problem considered here was to associate m stimulus-response pattern pairs in a two-layer feedforward network. The stimulus layer was denoted S and the response layer R. All stimulus patterns had NS bits with KS 1’s and NS − KS 0’s. Response patterns had NR bits with KR 1’s and NR −KR 0’s. This K-of-N firing scheme approximates a competitive situation where broadly directed afferent inhibition keeps overall activity stable. Such competition is likely to occur in the dentate gyrus and cornu ammonis (CA) regions of the mammalian hippocampus (Miles 1990; Minai & Levy 1994), which has led to the use of K-of-N firing in several hippocampal models (Sharp 1991; O’Reilly & McClelland 1994; Burgess et al. 1994). The outputs of stimulus neuron j and response neuron i are denoted xj and zi , respectively, and wij denotes the modifiable weight from j to i. The activation of response neuron i is given by X wij xj . (2.1) yi = j∈S
All learning was done off-line, that is, the weights were calculated explicitly over a given stimulus-response data set. Performance was tested by presenting each stimulus pattern to a trained network, producing an output through KR -of-NR firing, and comparing it with the correct response. A figure of merit for each network on a given data set was calculated as P=
h−q , KR − q
(2.2)
where h is the number of correctly fired neurons averaged over the whole data set and q ≡ KR 2 /NR is the expected number of correct firings for a random response with KR active bits. Thus, P = 1 indicated perfect recall, and P = 0 represented performance equivalent to random guessing.
Covariance Learning of Correlated Patterns in Competitive Networks
669
Figure 1: Histograms of individual neuron activities (i.e., fraction of patterns for which a neuron is 1) for four different levels of correlation. (a) Randomly generated patterns, no correlation. (b) 50 of 200 prestimulus neurons active per pattern ( fP = 0.25). (c) 100 of 200 prestimulus neurons active ( fP = 0.5). (d) 150 of 200 prestimulus neurons active ( fP = 0.75). The distribution of activities is close to normal for random patterns but becomes increasingly skewed with growing correlation.
Correlated stimulus patterns were generated as follows. First, an NS ×NP weight matrix V was generated, such that each entry in it was a 0-mean, unit variance uniform random variable. Then m independent KP -of-NP prestimulus patterns were generated, and each of these was then linearly transformed by V and the result thresholded by a KS -of-NS selection process to produce m stimulus patterns. The correlation arose because of the fixed V weights and could be controlled by changing the fP ≡ KP /NP fraction— termed the correlation parameter in this article. Figure 1 shows histograms of average activity for individual stimulus neurons in cases with different degrees of correlation. Correlated response patterns were generated similarly with another set of V weights. 3 Mathematical Motivation The learning rule proposed in this article is motivated directly by three existing rules: the Hebb rule (Hebb 1949), the presynaptic rule, and the co-
670
Ali A. Minai
variance rule. This motivation is developed systematically in this section. Several other related rules that are also part of this study are briefly discussed at the end of this section. For binary patterns, insight into various learning rules can be obtained through a probabilistic interpretation of weights. Consider the normalized Hebb rule, defined by wij = hxj zi i,
(3.1)
where h.i denotes an average over all m patterns. The weights in this case can be written as wij = P(xj = 1, zi = 1),
(3.2)
that is, the joint probability of pre- and postsynaptic firing. For correlated patterns, this is not a particularly good prescription. A significantly better learning rule is the presynaptic rule: wij =
hxj zi i hxj i
= P(zi = 1 | xj = 1).
(3.3)
This rule is the off-line equivalent of the on-line incremental prescription: wij (t + 1) = wij (t) + ²xj (t)(zi (t) − wij (t)),
(3.4)
hence, the designation “presynaptic.” This incremental rule has been investigated by several researchers (Minsky & Papert 1969; Grossberg 1976; Minai & Levy 1993; Willshaw et al. forthcoming). With this rule, the denµ dritic sum, yi , of output neuron i when the µth stimulus is presented to the network is given by X µ µ P(zi = 1 | xj = 1)xj , (3.5) yi = j
which sums the conditional firing probabilities for i over all input bits active in the µth stimulus. This rule works well in noncompetitive situations, where each output neuron sets its own firing threshold, and in competitive situations, where all output units have identical firing rates. However, in a competitive situation with correlated patterns, output neurons with high firing rates acquire large weights and lock out other output units, precluding proper recall of many patterns. A rule that addresses the problem of differences in output unit firing rates is the covariance rule, wij = hxj zi i − hxj ihzi i = P(xj = 1, zi = 1) − P(xj = 1)P(zi = 1) = P(xj = 1)[P(zi = 1 | xj = 1) − P(zi = 1)],
(3.6)
Covariance Learning of Correlated Patterns in Competitive Networks
671
which accumulates the conditional increment in probability of firing over active stimulus bits: µ
yi =
X
µ
P(xj = 1)[P(zi = 1 | xj = 1) − P(zi = 1)]xj .
(3.7)
j
However, it also scales the incremental probabilities by P(xj = 1), which means that highly active stimulus bits contribute more to the dendritic sum than less active ones, even if both bits imply the same increment in firing probability. This is not a problem when all stimulus units have the same firing rate, but it significantly reduces performance for correlated stimuli. The analysis of the presynaptic and covariance rules immediately suggests a hybrid that combines their strengths and reduces their weaknesses. This is the proposed presynaptic covariance rule: hxj zi i − hxj ihzi i hxj i hxj zi i − hzi i = hxj i
wij =
= P(zi = 1 | xj = 1) − P(zi = 1).
(3.8)
Now the dendritic sum simply accumulates incremental firing probabilities over all active inputs, µ
yi =
X µ [P(zi = 1 | xj = 1) − P(zi = 1)]xj ,
(3.9)
j
and the KR output units with the highest accumulated increments fire. It can be shown easily that in the K-of-N situation, the mean dendritic sums for the presynaptic, covariance, and presynaptic covariance rules are: Presynaptic : hyi i = KS hzi i X hxj iCovij Covariance : hyi i = j
Presynaptic covariance : hyi i = 0
(3.10)
where KS is the number of active bits in each stimulus and Covij is the sample covariance of xj and zi . Clearly the presynaptic covariance rule “centers” the dendritic sum to produce fair competition in a correlated situation. Note, however, that there are limits to how much correlation any rule can tolerate; this issue is addressed in section 5.
672
Ali A. Minai
3.1 Other Rules. It is also instructive to compare learning rules other than those mentioned above with the presynaptic covariance rule. These are: 1. Tsodyks-Feigel’man (TF) rule: wij = h(xj − p)(zi − r)i, P P where p = NS −1 j∈S hxj i and r = NR −1 i∈R hzi i.
(3.11)
2. Postsynaptic covariance rule: wij =
hxj zi i − hxj ihzi i . hzi i
(3.12)
3. Willshaw rule: wij = H(hxj zi i),
(3.13)
where H(y) is the Heaviside function. 4. Correlation coefficient rule: wij =
hxj zi i − hxj ihzi i , σj σi
(3.14)
where σj and σi are sample variances of xj and zi , respectively. The TF rule differs from the covariance rule in its use of global firing rates p and r rather than separate ones for each neuron. The two rules are identical for uncorrelated patterns. Dayan and Willshaw (1991) have shown that the TF rule is optimal among all linear rules for noncompetitive networks and uncorrelated patterns. The postsynaptic covariance rule was included in the list mainly because of its superficial symmetry with the presynaptic covariance rule. In fact, the postsynaptic covariance rule sets weights as wij =
hxj zi i −hxj i = P(xj = 1 | zi = 1)−P(xj = 1) = hxj | zi = 1i−hxj i. (3.15) hzi i
Thus, the weights to neuron i come to encode the centroid of all stimulus patterns that fire i, offset by the average vector of all stimuli. Essentially this rule corresponds to a biased version of Kohonen-style learning. It is likely to be useful mainly in clustering applications. The correlation coefficient rule is included because it represents a properly normalized version of the standard covariance rule. The weights are, by definition, correlation coefficients, so the firing of neuron i is based on how positively it correlates with the given stimulus pattern. Finally, the inclusion of the Willshaw rule (Willshaw, Buneman, & LonguetHiggins 1969) reflects its wide use, elegant simplicity, and extremely good performance for sparse patterns.
Covariance Learning of Correlated Patterns in Competitive Networks
673
4 Simulation Results Three different aspects were evaluated by simulations reported here. Unless otherwise indicated, both stimulus and response layers had 200 neurons. NP was also fixed at 200. All data points were averaged over ten independent runs, each with a new data set. In each run, all rules were applied to the same data set in order to get a fair comparison. 4.1 Pattern Density. In this set of simulations, the number of active stimulus bits, KS , and active response bits, KR , for each pattern were varied systematically and the performance of all rules evaluated in each situation. In each run, the stimuli and the responses were correlated through separate, randomly generated weights, as described above. The number of associations was fixed at 200. Both stimulus and response patterns were correlated with fP = 0.25. The results are shown in Table 1. It is clear that the presynaptic covariance rule performs better than any other rule over the whole range of activities. At low stimulus activity, the presynaptic covariance rule is significantly better than the standard covariance rule over the entire range of response activities. As stimulus activity increases, the performance of the presynaptic covariance falls to that of the standard rule. This is because the effect of division by hxj i is attenuated as all average stimulus activities approach 1, making the weights under the presynaptic covariance rule almost uniformly scaled versions of those produced by the standard rule. This suggests that the best situation for using the presynaptic covariance rule is with sparse stimuli. As expected, results with all other rules were disappointing. 4.2 Effect of Correlation. The correlated stimuli and responses used for simulations were generated from random patterns with activity fP = KP /NP via fixed random weights. The higher the fraction fP was, the more correlated were the patterns (see Figure 1). To evaluate the impact of different levels of correlation on learning with the presynaptic, covariance, and presynaptic covariance rules, three values of fP —0.25, 0.5, and 0.75—were investigated. In all cases, NP was fixed at 200, and network stimuli and responses were 10/200—quite sparse. The results are shown in Figure 2. Figure 2a shows the effect of stimulus correlation with uncorrelated response patterns. Clearly all three rules lose some performance, as would be expected from the fact that stimulus correlation corresponds to a loss of information about the outputs. However, the presynaptic covariance rule is significantly superior to the covariance rule. The presynaptic rule, which should theoretically have the same performance as the presynaptic covariance rule in this case, is affected by the residual response correlation introduced by finite sample size. Figure 2b shows the impact of response correlation. As expected from theoretical analysis, the covariance and presynaptic covariance rules have
10/10 0.6012 ±0.0131 0.4337 ±0.0207 0.4717 ±0.0284 0.4006 ±0.0123 0.4467 ±0.0236 0.4340 ±0.0214 0.2856 ±0.0243 0.5276 ±0.0163
10/100 0.8236 ±0.0068 0.6700 ±0.0192 0.4320 ±0.0219 0.2768 ±0.0179 0.3439 ±0.0252 0.6698 ±0.0203 0.1247 ±0.0231 0.6452 ±0.0182
10/190 0.8032 ±0.0056 0.7936 ±0.0086 0.2116 ±0.0172 0.2051 ±0.0187 0.2723 ±0.0156 0.2860 ±0.0404 0.1187 ±0.0102 0.6654 ±0.0165
100/10 0.5202 ±0.0091 0.5117 ±0.0133 0.3434 ±0.0157 0.3480 ±0.0183 0.3842 ±0.0188 0.5112 ±0.0138 0.0082 ±0.0296 0.5075 ±0.0136
100/100 0.7969 ±0.0092 0.7861 ±0.0115 0.2163 ±0.0321 0.2117 ±0.0330 0.2719 ±0.0215 0.7926 ±0.0083 0.0066 ±0.0164 0.6570 ±0.0338
100/190 0.6518 ±0.0363 0.6776 ±0.0302 0.2082 ±0.0120 0.2087 ±0.0120 0.3596 ±0.0476 0.2198 ±0.0206 0.0227 ±0.0211 0.3190 ±0.0378
190/10 0.4182 ±0.0180 0.4358 ±0.0187 0.3485 ±0.0158 0.3485 ±0.0159 0.4558 ±0.0090 0.4362 ±0.0184 −0.0014 ±0.0220 0.2567 ±0.0412
190/100
0.6623 ±0.0272 0.6805 ±0.0232 0.2035 ±0.0188 0.2131 ±0.0243 0.3673 ±0.0270 0.6854 ±0.0247 0.0074 ±0.0270 0.3600 ±0.0387
190/190
Note: The network has NS = NR = 200, and 200 stored associations in each run. Each rule was evaluated in nine situations: stimulus activities of 10, 100, and 190 out of 200, each at response activities of 10, 100, and 190 out of 200. These are indicated in the top row of the table by stimulus/response activity. The ± entries indicate sample standard deviation in each case. NP was set at 200 and KP at 50 for both stimuli and responses. Each data point is averaged over ten independent runs.
0.8211 ±0.0129 Covariance 0.6676 ±0.0295 Presynaptic 0.4213 ±0.0156 Hebb 0.2854 ±0.0326 Tsodyks-Feigel’man 0.3675 ±0.0535 Postsynaptic covariance 0.2157 ±0.0232 Willshaw 0.6761 ±0.0264 Correlation coefficient 0.6306 ±0.0263
Presynaptic covariance
Rule
Stimulus/response activity
Table 1: Performance of Each Rule at Different Levels of Stimulus and Response Coding Densities
674 Ali A. Minai
Covariance Learning of Correlated Patterns in Competitive Networks
675
Figure 2: The performance of the learning rules at different correlation levels for stimulus and response correlations. In each graph, the × symbol indicates the performance of the presynaptic covariance rule, the solid line the performance of the standard covariance rule, and the dotted line that of the presynaptic rule. (a) The case of correlated stimuli and uncorrelated responses. (b) The results for correlated responses and uncorrelated stimuli. In all cases, stimulus and response activities were set at 10-of-200. NP was set at 200 and KP at 50. Each data point was averaged over ten independent runs. The network had NS = NR = 200, and 200 associations were stored during each run.
almost identical performance, while the presynaptic rule is significantly worse. An interesting feature, however, is the U-shape of the performance curve for the presynaptic rule, which reflects the rule’s inherent propensity toward clustering. This is discussed in greater detail below. 4.3 Capacity. To evaluate the capacity of the various learning rules, increasingly large sets of associations were stored in networks of sizes 100, 200, and 400. In each case, the size referred to the number of stimulus or response neurons, which was the same. The correlation parameter, fP , was fixed at 0.25 and stimulus and response activity at 5% active bits per pattern. The results are shown in Figures 3 and 4 for correlated stimuli and responses. Again, the presynaptic covariance rule did better than all others at all sizes and loadings, and for all situations. Figure 3d shows how capacity scales with network size for the presynaptic covariance rule. Each line represents the scaling for a specific performance threshold. Clearly, the scaling is linear in all cases, though it is not clear whether this holds for all levels of correlation. Figure 4 shows how the slope of the size-capacity plot changes with performance threshold. The results for the case with correlated stimulus only are also plotted for comparison. The theoretical analysis of these results will be presented in future papers.
676
Ali A. Minai
Figure 3: The performance of the learning rules for data sets and networks of different sizes. In graphs a–c, the dashed line shows the performance of the Willshaw rule, the dot-dashed line that of the TF rule, and the dotted line the performance of the normalized Hebb rule. The performance of the presynaptic covariance rule is indicated by × and that of the standard covariance rule by ◦. Graph d shows the empirically determined dependence of the presynaptic rule’s capacity as a function of network size for performance levels of 0.95(×), 0.9(◦), 0.8(∗), and 0.75(+). In all cases, stimulus and response activities were set at 10-of-200. NP was set at 200 and KP at 50 for both stimuli and responses. Each data point was averaged over ten independent runs.
5 Discussion It is instructive to consider the semantics of the presynaptic covariance rule vis-`a-vis those of the standard covariance rule in more detail. The weights produced by the presynaptic covariance rule may be written as wij = P(zi = 1 | xj = 1) − P(zi = 1) = hzi | xj = 1i − hzi i.
(5.1)
Thus, a positive weight means that xj = 1 is associated with higher-thanaverage firing of response neuron i, and a negative weight indicates the reverse. In this way, effects due to disparate activities of stimulus neurons are mitigated, and the firing of response neurons depends on only the average incremental excitatory or inhibitory significance of individual stimulus
Covariance Learning of Correlated Patterns in Competitive Networks
677
Figure 4: The graph shows the slope of the capacity-size line for the presynaptic covariance rule (cf. Figure 3d) as a function of the performance level used to determine capacity. The curve with × markers corresponds to the case reported in Figure 3, and the curve with ◦ markers is for correlated stimuli and uncorrelated responses. The latter is included for comparison purposes. All parameters are as in Figure 3.
neurons, not just on covariance. In a sense, “cautious” stimulus neurons— those that fire seldom, but when they do fire, reliably indicate response firing—are preferred over “eager” stimulus neurons—those that fire more often but often without concomitant response firing. This is very valuable in a competitive firing situation. To take a somewhat extreme illustrative example, consider a response neuron i with hzi i = 0.5, and two stimulus bits x1 and x2 with hx1 i = 0.1, hx1 zi i = 0.1, and hx2 i = 0.99, hx2 zi i = 0.5. Thus, x1 = 1 ⇒ zi = 1 and zi = 1 ⇒ x2 = 1 (or, alternatively, x2 = 0 ⇒ zi = 0). However, x2 = 1 does not imply zi = 1. Under the covariance rule, wi1 = wi2 = 0.05, even though x1 = 1 is a reliable indicator of zi = 1 while x2 = 1 is not. The presynaptic covariance rule sets wi1 = 0.5 and wi2 = 0.0505, reflecting the increase in the probability of zi = 1 based on each stimulus bit (x1 = 1 raises the probability of zi = 1 to 1.0, while x2 = 1 raises it only to 0.5505.) In this way, each stimulus bit is weighted by its reliability in predicting firing. The
678
Ali A. Minai
information x = 1 ⇒ z = 1 is extremely valuable even if x = 0 does not imply z = 0, while x = 0 ⇒ z = 0 without x = 1 ⇒ z = 1 is of only marginal value. Simple covariance treats these two situations symmetrically, whereas in fact they are quite different. The presynaptic covariance rule rectifies this. However, a price is paid in that correct response firing becomes dependent on just a few cautious stimulus bits, creating greater potential sensitivity to noise. It should be noted that the asymmetry of significance between the two situations just described arises because the usual linear summation of activation treats x = 1 and x = 0 asymmetrically, discarding the “don’t fire” information in x = 0. The comparison with the presynaptic rule is also interesting. As shown in Figure 2b, the presynaptic rule has a U-shaped dependence on output correlation, while both covariance rules have a monotonically decreasing one. This reflects the basic difference in the semantics of the two rule groups. In the presence of output correlation and competitive firing, the presynaptic rule, like the Hebb rule, has a tendency to cluster stimulus patterns—that is, to fire the same output units for several stimuli. These “winner” outputs are the neurons with a high average activity. This is a major liability if there are other output neurons with moderately lower average activity, since their firing is depressed disproportionately. However, as output correlation grows to a point where extremely disproportionate firing is the actual situation, the clustering effect of both the presynaptic and the Hebb rules ceases to be a liability, and they recover performance. In the extreme case where some response neurons fire for all stimuli and the rest for none, the presynaptic and Hebb rules are perfect. In contrast, the covariance rules fail completely in this situation because all covariances are 0. The main point, however, is that for all but the most extreme response correlation, the presynaptic covariance rule is far better than the presynaptic or the Hebb rules. The presynaptic and the presynaptic covariance rules are identical in the noncompetitive situation with adaptive firing thresholds. 5.1 Competitive versus Noncompetitive Firing. The results of this article apply only to the competitive firing situation. However, it is interesting to consider briefly the commonly encountered noncompetitive situation where each response neuron fires independently based on whether its dendritic sum exceeds a threshold. By assuming that thresholds are adaptive, an elegant signal-to-noise ratio (SNR) analysis can be done to evaluate the performance of different learning rules for threshold neurons (Dayan & Willshaw 1991). This SNR analysis is based on the dendritic sum distributions for the cases where the neuron fires (the “high” distribution) and those where it does not (the “low” distribution). The optimal firing threshold for the neuron can then be found easily. The key facilitating assumption in SNR analysis is that each response neuron fires independently, but this is not so in the competitive case. Indeed, for competitive firing, there is no fixed threshold for any response neuron;
Covariance Learning of Correlated Patterns in Competitive Networks
679
Figure 5: Errors of omission and commission by the presynaptic covariance and the standard covariance rule in a single run with noncompetitive firing. The threshold for each neuron was set at the empirically determined intersection point of its “low” and “high” dendritic sum distributions. Graphs a and b show the histograms for errors of commission (a) and omission (b) for the presynaptic covariance rule, and graphs c and d show the same for the standard covariance rule. The networks have NS = NR = 400, and KS = KR = 50. The correlation parameter is 0.25 for both stimuli and responses, and 200 associations are stored in each case.
an effective threshold is determined implicitly for each stimulus pattern depending on the dendritic sums of all other response neurons in the competitive group. Thus, the distribution of a single neuron’s dendritic sum over all stimulus patterns says little about the neuron’s performance. It is, rather, the distribution of the dendritic sums of all neurons over each stimulus pattern that is relevant. The “high” distribution for stimulus k covers the dendritic sums of response neurons that should fire in response to stimulus k, and the “low” distribution covers the dendritic sums of those response neurons that should not fire. The effective firing threshold for stimulus pattern k is then determined by the dendritic sum of the KR th most excited neuron. In a sense, there is a duality between the competitive and noncompetitive situations. In the former, the goal is to ensure that the correct neurons win for each pattern; in the latter, the concern is that the correct patterns fire a given neuron. Figure 5 illustrates the results of comparing the errors committed by the covariance and presynaptic covariance rules on the same learning problem
680
Ali A. Minai
in the non-competitive case. The network has 400 neurons in both layers and stores 200 associations. Activity was set at 50 (25%) to ensure reasonable sample size, and fP was set to 0.25. The threshold was set at the intersection point of the empirically determined “high” and “low” dendritic sum distributions. This is optimal for gaussian distributions. Errors of omission (0 instead of 1) and commission (1 instead of 0) are shown separately. Although Figure 5 is based on only a single run, it does suggest that the presynaptic covariance rule improves on the covariance rule in the noncompetitive case as well.
6 Conclusion The results presented here demonstrate that for correlated patterns in competitive networks, the presynaptic covariance rule is clearly superior to the presynaptic rule or the standard covariance rule when the stimulus patterns are relatively sparsely coded (average activity < 0.5). This is true with respect to capacity and performance. These conclusions need to be formally verified through mathematical analysis and more simulations. The results of this work will be reported in future papers. For uncorrelated stimulus patterns, simulations (not shown) indicate that the standard covariance rules has virtually the same performance as the presynaptic covariance rule.
References Burgess, N., Recce, M., & O’Keefe, J. (1994). A model of hippocampal function. Neural Networks, 7, 1065–1081. Dayan, P., & Willshaw, D. J. (1991). Optimising synaptic learning rules in linear associative memories. Biol. Cybernetics, 65, 253–265. Dudek, S. M., & Bear, M. F. (1992). Homosynaptic long-term depression in area CA1 of hippocampus and effects of N-methyl-D-aspartate receptor blockade. Proc. Natl. Acad. Sci. USA, 89, 4363–4367. Fujii, S., Saito, K., Miyakawa, H., Ito, K-I., & Kato, H. (1991). Reversal of longterm potentiation (depotentiation) induced by tetanus stimulation of the input to CA1 neurons of guinea pig hippocampal slices. Brain Res., 555, 112–122. Green, T. S., & Minai, A. A. (1996). A covariance-based learning rule for the hippocampus. In J. M. Bower (Ed.), Computational Neuroscience (pp. 21–26). San Diego: Academic Press. Green, T. S., & Minai, A. A. (1995). Place learning with a stable covariance-based learning rule. Proc. WCNN 1, 573–576. Grossberg, S. (1976). Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors. Biol. Cybernetics, 23, 121–134. Hebb, D. O. (1949). The organization of behavior. New York: Wiley.
Covariance Learning of Correlated Patterns in Competitive Networks
681
Levy, W. B., & Steward, O. (1983). Temporal contiguity requirements for longterm associative potentiation/depression in the hippocampus. Neuroscience, 8, 791–797. Malenka, R. C. (1994). Synaptic plasticity in the hippocampus: LTP and LTD. Cell, 78, 535–538. Miles, R. (1990). Variation in strength of inhibitory synapses in the CA3 region of guinea-pig hippocampus (in vitro). J. Physiol., 431, 659–676. Minai, A. A., & Levy, W. B. (1993). Sequence learning in a single trial. Proc. WCNN 2, 505–508. Minai, A. A., & Levy, W. B. (1994). Setting the activity level in sparse random networks. Neural Computation, 6, 85–99. Minsky, M., & Papert, S. (1969). Perceptrons. Cambridge, MA: MIT Press. Mulkey, R. M., & Malenka, R. C. (1992). Mechanisms underlying induction of homosynaptic long-term depression in area CA1 of the hippocampus. Neuron, 9, 967–975. O’Reilly, R. C., & McClelland, J. L. (1994). Hippocampal conjunctive encoding, storage, and recall: Avoiding a tradeoff. Hippocampus, 6, 661–682. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biol. 4, 303–321. Selig, D. K., Hjelmstand, G. O., Herron, C., Nicoll, R. A., & Malenka, R. C. (1995). Independent mechanisms for long-term depression of AMPA and NMDA responses. Neuron, 15, 417–426. Sharp, P. E. (1991). Computer simulation of hippocampal place cells. Psychobiology, 19, 103–115. Stanton, P. K., & Sejnowski, T. J. (1989). Associative long-term depression in the hippocampus induced by Hebbian covariance. Nature, 339, 215–218. Tsodyks, M. V., & Feigel’man, M. V. (1988). The enhanced storage capacity in neural networks with low activity level. Europhys. Lett., 6, 101–105. Willshaw, D. J., Buneman, O. P., & Longuet-Higgins, H. C. (1969). Nonholographic associative memory. Nature, 222, 960–962. Willshaw, D., Hallam, J., Gingell, S., & Lau, S. L. (Forthcoming). Marr’s theory of the neocortex as a self-organizing neural network. Neural Computation.
Received January 4, 1996; accepted June 28, 1996.
Communicated by Christopher Bishop
A Mobile Robot That Learns Its Place Sageev Oore Geoffrey E. Hinton Department of Computer Science, University of Toronto, Toronto, Canada M5S 1A4
Gregory Dudek School of Computer Science, McGill University, Montreal, Canada H3A 2A7
We show how a neural network can be used to allow a mobile robot to derive an accurate estimate of its location from noisy sonar sensors and noisy motion information. The robot’s model of its location is in the form of a probability distribution across a grid of possible locations. This distribution is updated using both the motion information and the predictions of a neural network that maps locations into likelihood distributions across possible sonar readings. By predicting sonar readings from locations, rather than vice versa, the robot can handle the very nongaussian noise in the sonar sensors. By using the constraint provided by the noisy motion information, the robot can use previous readings to improve its estimate of its current location. By treating the resulting estimates as if they were correct, the robot can learn the relationship between location and sonar readings without requiring an external supervision signal that specifies the actual location of the robot. It can learn to locate itself in a new environment with almost no supervision, and it can maintain its location ability even when the environment is nonstationary. 1 Introduction We describe a method that allows a robot equipped with a ring of sonar range finders to learn to compute its location within an initially unfamiliar environment. The obvious way to solve this problem is to build an explicit map of the environment. An interesting alternative, however, is for the knowledge of the environment to be implicit in the weights of a neural network that relates sonar readings to locations. The robot, of small trash-can proportions, has internal sensors that detect the movement in the wheels. These motion sensors may not be entirely accurate (they are blind to skidding) and therefore are not sufficient for accurate navigation over long periods of time. Each of the 12 uniformly spaced Polaroid sonar transducers crowning the robot acts as both transmitter and Neural Computation 9, 683–699 (1997)
c 1997 Massachusetts Institute of Technology °
684
Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek
receiver, measuring the return time of the first echo from an emitted chirp. The time is converted into a distance r, and the vector r of all 12 readings obtained at a point is called a sonar signature. The sonar sensings, too, are affected by various forms of noise, both reproducible and irreproducible. For example, the sound emitted by each sensor spreads out in a cone, and rays within the cone may make different numbers of bounces before returning. This can lead to phantom walls and phantom holes in corners (Dudek, Jenkin, Milios, & Wilkes, 1993; Wilkes, Dudek, Jenkin, & Milios, 1990), depending on the configuration and reflective properties of objects in the room. In addition, faulty sensors, cross-talk between transducers, energy dissipation of the sonar signal, unreturned echos, and very nearby objects can lead to irreproducible random effects that have no apparent structure. Moreover, even the accurate sensings can be assumed to have approximately gaussian additive noise. In this work, we have used simulator-based data, as will be described in section 3.1, and have assumed that the robot’s orientation is known. 2 Learning a Model That Relates Locations to Sonar Readings 2.1 Predicting the Location from the Sonar Readings. There is an obvious way of using a feedforward neural network to derive a location from a set of sonar readings. The inputs to the net are the sonar readings, and the outputs are the location, which may be represented as two real-valued coordinates or may be a probability distribution over a grid of possible locations. Although this type of net works moderately well (Dudek & Hinton 1992), it has some major drawbacks. During the initial training, the network requires supervision in the form of accurate information about the true location of the robot, and further supervision is required whenever the environment changes. The network is also poor at handling situations in which a sonar sensor is unreliable or a small change in location causes the sensor to return the maximum value rather than the distance to a small, close surface. When this happens, the input to the network changes a lot, but the outputs should change only slightly. Such slight changes are hard to achieve in a standard feedforward network because the outputs must be sensitive to at least some of the inputs. 2.2 Integrating Sonar Information over Time. Instead of using a neural network to predict locations from sonar readings, we could predict in the opposite direction and then use Bayes’ theorem. For each sonar signature, r, the same feedforward neural network is applied once at each possible grid cell, gi , to estimate the likelihood, P(r | gi ), of obtaining that signature from that cell. Instead of integrating over all possible locations within each cell, the neural net takes as input the location, xi , of the center of the cell and produces as output a conditional probability distribution under which the density of r is computed. The likelihood of r at each cell is then multi-
A Mobile Robot that Learns its Place
685
plied by the robot’s prior probability of being at the cell, and the products are renormalized1 to obtain a posterior probability distribution that incorporates the information in the current sonar readings. In the case that we know the environment is stationary (a common assumption), we can improve the efficiency of the tracking algorithm by a large factor, once the neural network has been trained, by keeping the density distribution functions P(r | gi ) in a lookup table, and thus eliminating the need for any more forward passes through the network: Ptposterior (gi )
P(rt | gi )Ptprior (gi ) . = P t t j P(r | gj )Pprior (gj )
(2.1)
To make the updating straightforward, we need to assume that the prior probabilities of cells are independent of the likelihoods obtained for the current sonar readings. This independence assumption is dubious because the prior probability distribution was derived using the same neural network and the same sensors on earlier nearby locations. To see just how bad the independence assumption can be, imagine what would happen if the robot stayed at exactly the same location and kept repeating its sonar readings. It would eventually be certain that it was in the cell that had the highest likelihood of producing the observed distribution of sonar signatures, even if many other cells had a likelihood almost as high. It is well known that maximum likelihood estimation will yield a biased prediction of the variance of a distribution, which could further increase the chances of confidence in an incorrect position estimate. To check for this effect, we tried using Mackay’s (1991) unbiased variance prediction method and found that the results did not differ significantly. The prior probability distribution is obtained by taking the previous posterior distribution and updating it using the noisy movement information.2 To eliminate the increase in entropy of the distribution caused by quantization effects when the movement in each direction is not an integral multiple of the grid spacing, we separate the expected movement in each dimension into an integral and a fractional part. The fractional part, f t−1,t , is accumulated by moving the location of every grid cell: + f t−1,t . xti = xt−1 i
(2.2)
1 By choosing a fine enough grid, we can assume that the density of the position distribution is locally linear within a cell, and therefore the probability mass contained in the cell is well approximated by using the probability density of a reading being obtained at the center of a cell. 2 This approach resembles Kalman filtering, but the probability distribution over the underlying state is nongaussian, and the model that relates the underlying state to the observations is nonlinear.
686
Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek
We can imagine this procedure as manipulating a single vector offset associated with the entire grid. For example, suppose each grid cell is 10 cm wide, and a motion of 23 cm to the east is perceived. Then the probabilities in each cell are moved over by two cells, and furthermore the offset of the entire grid is increased by 3 cm (i.e., the centers of the individual cells are themselves displaced). If the offset was already 9 cm, then we would instead move the probabilities over by three cells and subtract 7 cm from the offset, to keep the offset between zero and one cell width. After moving the probabilities over and adjusting the grid offset, we blur the distribution by the assumed noise, σ t−1,t in the movement estimate, Ptprior (gi ) =
X
wji Pt−1 posterior (gj ),
(2.3)
j
where the wji are the terms in a discrete approximation to a gaussian convolution kernel. In our simulation, the actual motion of the robot is a random sample from a gaussian with mean µt−1,t and standard deviation σ t−1,t . We have not added orientation noise to the motion model. When the boundaries of the environment are known, any probability of being outside the grid can be accumulated in the nearest edge cell. In this way, continual motion in the same direction will eventually lead to near certainty of being somewhere along the corresponding edge. 2.3 Representing the Robot in the Environment. The most convenient and natural way of representing the environment given the nature of our localization method is in the form of a grid, where a value P(gi ) is associated with each cell representing the probability of the robot being in that cell. This is reminiscent of Moravec’s (1988) certainty grid but with a key difference: The typical certainty grid lends itself well to constructing a map of the environment, and has been used effectively for that purpose (e.g., see Thrun et al., in press). In our approach the sensor readings are used immediately to update the estimate of the robot’s position; using other representations, there needs to be a way of making combined use of the map and the sensor readings to compute the location of the robot within that map (e.g., by modeling the inverse sonar process or optimizing a template match). In our approach the structure of the environment remains implicit in the models that assign probability distributions to sonar readings at each grid location. The integration over time of our knowledge of the environment is handled implicitly in the learning, whereas the integration of motion over time is handled explicitly by moving the probability distribution. 2.4 Learning to Predict Sonar Readings from Location. We considered several different types of networks for estimating the likelihoods of the sonar readings. All of these nets had two input units to represent location coordinates, and the output units were divided into 12 groups, with each
A Mobile Robot that Learns its Place
687
group specifying the parameters of a probability distribution for the readings from one sonar sensor. When the true location of the robot was known, the net was trained to maximize the log probability density of the actual sonar readings under the distribution represented by the output values of the net when the inputs were the true location. Initially we used a single hidden layer of logistic units and modeled the probability distribution for each sonar sensor as a mixture of a gaussian and a uniform distribution between 0 and the maximum value of 4 m. This required three output units per sonar sensor: a logistic unit for the mixing proportion of the uniform distribution, a linear unit for the mean of the gaussian, and an exponential unit for its variance. The use of an exponential is appropriate for a scale parameter like the variance, and it prevents the variance from ever becoming negative or zero. In our initial model, the predicted mean of the gaussian could be a nonlinear function of the two input coordinates because we used logistic hidden units. Later we found that it was better to use a different type of nonlinearity to predict the mean of the gaussian from the input coordinates. The mean is typically a linear function of the input coordinates over a small region but then switches to a different linear function when the sonar return comes from a different planar surface (see Fig. 1), so it is natural to use a mixture of linear models for predicting the mean. We used 12 separate networks, one for each sonar sensor; each such network, n, implemented a mixture of local models. Each local model was itself a mixture of a gaussian and a uniform. The mixing proportions of the local models were a function of the current input coordinates and were specified by the activity levels of a group of normalized radial basis units. The activity level of each basis unit, j, was determined by the relative proximity of its center, cnj , to the two input coordinates, x: · ¸ 2 −kx−cnj k exp 2 2σRBFn · ¸. (2.4) ψnj (x) = P −kx−cnk k2 exp k 2σ 2 RBFn
2 were learned.3 Both the centers of the units and their shared variance σRBF n Each basis unit has three associated output units that represent the parameters of a predicted probability distribution for one sonar reading. This distribution is a mixture of a gaussian and a uniform. The basis unit has one adaptive outgoing weight that determines the mixing proportion, zj , of the gaussian, one adaptive weight for the log of the variance, and three more
3 The planar walls of the room cause the piecewise linear regions in Figure 1 to have linear boundaries. This makes it appropriate for the radial basis functions to have the same variances because this ensures that, after normalization, the region dominated by an RBF has linear boundaries.
688
Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek
Figure 1: Readings from sensor 2. East-west and north-south positions are plotted along the x- and y-axes, respectively. Signatures were taken at every 2.5 cm inside a 2 m × 1 m rectangular region in the middle of a cluttered room. The measurements obtained by sensor 2 are plotted on the vertical axis. The measurements that drop below the 0 plane indicate locations where the sonar pulse did not return a detectable echo. Many of the vertical surfaces in the room are (nearly) orthogonal to the x- or y-axes, but sensor 2 is not parallel to either axis of the room. So sensor 2 tends to produce signals that have more multiple reflections and therefore a greater number of smaller discontinuous pieces.
outgoing weights, a, b, that define a linear model, which predicts the mean of the gaussian from the input coordinates. A picture of the network for a single sensor is shown in Figure 2. The likelihood distribution produced by basis unit j is: " # −krn − anj x − bnj k2 1 − znj 1 exp . + Pnj (rn | x) = znj rmax 2πσnj2 2σnj2
(2.5)
We have allowed each region an adaptive uniform noise mixing component because it is possible that the noisy, or unpredictable, readings are a function not only of the sensor but of the location of the robot as well. The normalized activity levels of the basis units are treated as mixing proportions for combining their likelihood distributions into a single likelihood
A Mobile Robot that Learns its Place
689
Figure 2: Final network model for one sonar sensor. The upper network shows 2 the normalized radial basis function units. A single variance σRBF is inferred for all regions of each sensor model. For each of the M regions, a linear unit is used to predict the mean of the gaussian Li (x), exponential units are used to learn variance, and sigmoid units are used to learn mixing proportions for uniform noise.
distribution for each sensor n: X Pnj (rn | x)ψnj (x). Pn (rn | x) =
(2.6)
j
The number of normalized radial basis functions (RBFs) needed to reach a certain positional accuracy is related to the size and complexity of the environment. When a local model is responsible for only a single data point, there is a danger that it will assign an extremely high-probability density to the observed reading by collapsing the width of the gaussian component of its output distribution toward zero. This form of overfitting is avoided by tying together the widths of the output gaussians for all the local models for one sonar sensor. Generally the more RBFs used, the better the performance tended to be, at the expense of learning time. We investigated some methods for pruning and adding RBF units during learning, with encouraging results, though beyond the scope of the current discussion (e.g., this could
690
Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek
be very useful in an environment that changes drastically over time). When x is known, all parameters—a, b, and σ , as well as those associated with the RBF—can be learned by maximizing log Pn (rn | x) (i.e., by gradient descent), but when the robot’s true location is unknown and all that is available is the robot’s current probability distribution over cells, it is less obvious what should be optimized. Our unsupervised algorithm, which is closely related to expectation maximization, maximizes the expected log likelihood of the sonar signatures, where the expectation is taken with respect to the posterior probabilities of being at each grid cell: Z (2.7) hlog P(r | x)iP(x) = log P(r | x)P(x)dx x
≈
X
log P(r | gi )Pposterior (gi )
(2.8)
i
where Pposterior (gi ) was calculated in equation 2.1. We used the conjugate gradient method to tune the parameters. 3 Experimental Results 3.1 The Simulated Environment and Noise. The environment and sonar data used in this work were generated synthetically by a sophisticated simulator provided by Wilkes et al. (1990). This simulator used a ray-tracing algorithm to capture many of the reproducible forms of nongaussian noise that can occur with sonar sensors, such as phantom holes, phantom walls, and occasional failures to detect nearby objects. Furthermore, to account for the noise that is irreproducible, we have made the worst-case assumption by replacing a certain fraction of the readings by random values chosen from a uniform distribution in the interval (0, rmax ). This replacement was done by assigning each sensor i an “unreliability” quotient qi , so that any reading taken by sensor i would have a probability qi of being replaced by a random number. The unreliability quotients were {0.1, 0.5, 0.001, 0.01, 0.1, 0.001, 0.1, 0.2, 0.001, 0.001, 0.1, 0.05}. In addition, to make the task more difficult, all sensor readings that attained the maximum value rmax were also replaced by a random number. Even for sensor readings considered “reliable,” gaussian noise was added (with standard deviations of {4, 4, 3, 3, 5, 5, 5, 5, 2, 1, 2, 10} cm). Note that the net does not attempt to model the relationship between the value of a sonar reading and the distance to the nearest object, only the relationship between the value of the sonar reading and the position of the robot. This means that any reproducible effects can potentially be utilized for better localization, including the “nearby objects appearing far away” effect, as long as there is some degree of consistency some of the time. The fraction of the time when there is no consistency can simply be modeled by the uniform noise-mixing proportion.
A Mobile Robot that Learns its Place
691
Figure 3: Top-down view of complete sonar signatures at two points 10 cm apart. The position of the robot is marked by the circle near the middle of the room (the one on the left is just slightly further north than the one on the right). The thicker lines are meant to be reflective surfaces: walls, dividers, chair legs, feet, a soda can under a desk, and a door. Each of the thin lines centered at the robot represents the distance that was measured by the sensor facing in that direction. For example, using a clock numbering of the sensors, the distance to the east wall was accurately measured in both cases by sensor 3, while some confusion occurred with sensor 9 as to the location of the western wall due to the interference of a chair base on the right. The lines that touch the outer boundaries—the measurements of sensor 2 and sensor 4—represent the directions from which no strong echo was returned.
The environment used in all of the following tests was a cluttered office, about 5 m by 5 m. Figure 3 illustrates the environment and signatures taken from two nearby positions. Note the difference in the two signatures. These signatures have not yet had any random noise added to them. We believe the clutter in the environment to be a realistic model of working environments, where numerous objects are left all over the place. 3.2 Tracking Algorithm. The unsupervised algorithm relies on the premise that combining the knowledge of the motion with the sonar-based network predictions will produce better position estimates than those that would be obtained by using either source of information alone. To demonstrate that this is typically true, we used supervised training of the network and then compared the errors of the dead-reckoned estimates (motion alone), the network estimates (sonar alone), and the filtered estimates (sonar and motion combined) as the robot travels along a path. The position is actually represented by a probability distribution over a grid, but for these comparisons, we used the mean of the distribution as the estimate.
692
Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek
Figure 4: Absolute errors in the predicted position for a single path. The noise in the motion sensors is multiplicative gaussian with a mean of zero and standard deviation of σmotion = 5 cm for every 20 cm of motion.
In Figure 4, a network was first trained on a set of 80 labeled examples taken from a boxed region of the position space, and then a random path 30 steps long was traversed without giving position information. The tracking algorithm is successful; the errors of the filtered predictions in Figure 4 are roughly equal to the smaller of the sonar- and motion-based prediction errors. Also shown is the expected value of the absolute error that would be obtained over many runs of a single path using the same noise model for the motion sensors. This smooth curve would continue to increase as the square root of the number of time steps. The variance in the errors of the dead-reckoned predictions would increase as well. At the early stages of the unsupervised learning, the network will typically make larger errors, but these will be more easily overwhelmed by the motion constraints, since the distributions predicted by the net will have a higher entropy. In Figure 5, a network was used that had been trained on only 20 examples. Although the variance in the sonar sensor predictions was considerably higher, the filtered predictions were still quite good. We also did runs in which we kept the robot’s motion model as described above (i.e., zero-mean gaussian), but affected the motion sensings with a systematic 10% overestimation error in addition to the gaussian noise (i.e., biased noise). Still, the filtered errors generally remained below 10 cm (Oore, 1995). A minimally trained network can be quite useful provided the net does not give too much confidence to its incorrect predictions. Occasional, accurate, high-certainty predictions made by the net cause a confident posterior
A Mobile Robot that Learns its Place
693
Figure 5: Errors in the predicted position for a single path. The motion sensings were affected by a systematic error (in which all traveled distances were overestimated by a factor of 10%) as well as a gaussian noise process with a larger variance, σMotion = 10 cm for every 20 cm traveled. The net was trained on 20 examples.
estimate. The motion model will then maintain this accuracy for a number of time steps, even if the subsequent predictions by the network are poor. 3.3 Unsupervised Learning. The motivation for an unsupervised method for robot learning begins with the dilemma that dead reckoning alone is insufficient but manual recalibration is too expensive to be practical. The aim of the unsupervised learning procedure is to discover a good generative model of the relationship between location and sonar readings. If the true location is known, this can be done by maximizing the likelihood of each observed sonar reading under the probability distribution produced by the model. Since the true location of the robot is unknown, the best we can do is to estimate the probability that it is at each grid location and then weight the learning by these estimated probabilities. 3.3.1 Training Procedure. The general training procedure used for the unsupervised learning consisted of a few steps: 0. Initialization. The 12 networks for predicting the likelihood distribution of a sonar reading from the coordinates of a grid point were initialized with small random weights, except that the normalized RBF centers were distributed uniformly over the environment; the
694
Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek
robot was placed in the room without any position information (i.e., a uniform distribution over the grid). 1. Data collection. The robot traversed a short path, consisting of a sequence of steps. An obstacle-avoidance mechanism was assumed in generating the path. A sonar signature was taken at each step, and the filtering algorithm given by equation 2.1 was used to update the probability distribution over grid cells by integrating the motion and sonar sensings. The updated distribution over cells was combined with the current sonar signature to yield weighted training examples of grid cell–sonar signature pairs. The examples were accumulated over all the steps in the path. 2. Training. Using the data just collected, the network was trained for a fixed number of iterations, after which the training set was discarded. The last filtered position distribution from the previous path was retained as the initial distribution when returning to step 1 to acquire more data. The weight vector of the trained net was retained as the starting point for the next minimization. Although the paths were random, they were generated to drift slowly from left to right and back again (and again) while drifting up and down more quickly. 3.3.2 Performance of the Unsupervised Learning Algorithm. The cluttered office environment was used, with a motion noise of σMotion = 2.5 cm for every 20 cm traveled. The network was trained for 30 conjugate gradient iterations after every 30 steps. Figure 6 shows the actual paths traveled and the paths as predicted from the sonar readings alone by the network, after 3 and then after 35 paths. The first peculiarity to observe is that although the filtered estimates correctly capture the shape of the true path, they are a translation of it. This can be explained by noting that the motion sensings provide relative position information, while the only indications of absolute position are an initial distribution, if given, and the edges of the grid if they are reached by the exploration. Thus, since the paths we used systematically continued to touch the east-west walls but did not extend fully in the northsouth direction, it is not surprising that the robot constructed a model that was self-consistent (that is, consistent with the motion constraints) but was a vertical translation of the actual environment. Examining the sonar-based errors, we can see that in the earlier predictions, the network seemed to get the right vicinity (modulo a translation) but made gross errors. After continuing to train over another 32 paths, the network learned a model that was a more accurate translation of the true environment. In both cases, the majority of the larger errors in the sonar predictions were corrected by the filtering algorithm.
A Mobile Robot that Learns its Place
695
Figure 6: Real vs. predicted paths. The axes of the graphs can be seen as representing a top-down view of a portion of the environment, in which the robot took a zigzag path going from west to east. For visual clarity, the paths have not been adjusted to match up. Inspection will reveal that the sonar sensings in the lower left-hand region of the earlier path are very loosely related to the shape of the true path. Nevertheless, it is immediately clear that they correspond to the right general region. In contrast to this, looking at the right-hand side of the later path shows that the sonar-based estimates exhibit a much closer match to the filtered and true positions.
696
Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek
Figure 7: Learning while exploring. The means were calculated over sets of paths, and the standard deviations were estimated from the sample, assuming gaussian noise (which is incorrect for values that cannot be negative).
Figure 7 shows the filtered and sonar errors with error bars for paths as the learning progresses, after performing the appropriate translation.4 We see that the filtered predictions were almost always better and had significantly smaller error bars than the sonar predictions. At around the fiftieth path, the variance in the sonar errors suddenly increased, for at this point, the environment was rearranged (chairs were added, moved, etc.). The average error did not increase by much because it was only in certain parts of the room that the network’s predictions were affected. This allowed the motion constraints to maintain accurate localization for a few time steps and then, before the filtered predictions started to deviate by much, a few confident and accurate sonar predictions in a familiar part of the room were all that were needed to recalibrate the tracking effectively. Since the normalized RBFs respond locally, they can quickly and easily adapt to local changes in the room. Although the network acted desirably in this case, problems can arise if too many consecutive, overconfident, but inaccurate sonar predictions are made. The threshold that determines overconfidence is, in part, a function of the noise model in the motion sensings. A possible danger is that if a certain region is avoided during initial training because of an obstacle, then when that obstacle is later removed, there may still be a hole in the posterior where the object used to be. However, 4 Done by calculating the mean error over a path and subtracting this from each prediction in the path.
A Mobile Robot that Learns its Place
697
this will be prevented to a certain extent for two reasons. First, during the initial period of training, if the distribution of the robot’s position is very entropic, the parameters for the distributions will be learned so that the entire position space is covered, and therefore those places that are not contained in the posterior distributions at later stages will still have been reasonably initialized early on. Second, for many of the sensor readings (in particular, those that are not pointing at the obstacle in question), the readings on either side of the obstacle are quite likely to belong to the same linear region, and therefore the corresponding linear predictors are already tuned to interpolate correctly through the occupied region. Various modifications in the training procedure can significantly improve the tracking and learning. For example, the learning can be accelerated by starting with a net trained on a small number of labeled examples rather than using a purely random initial net. Alternatively, the initial net can be untrained, but accurate position information can be given at the beginning of each of the first few paths. If the robot is returned to the same part of the room to start each path, it learns to make confident and accurate predictions in that region, and with further exploration it gradually expands the region in which its model is accurate. The exploration strategy and length of paths between minimizations are also important. If too many of the training points in one path come from the same part of the room, then the centers of the RBF units need to be frozen to prevent an uneven allocation of the network’s resources. Although some of these variations speeded up the learning, they were not crucial and were not used in the runs we have described. 4 Relationships to Other Approaches 4.1 Hidden Markov Models. The cells in the grid correspond to the states in a hidden Markov model, and the uncertainty in the movement information corresponds to the matrix of transition probabilities between states. The fact that this matrix changes at each time step is unusual but not incompatible with the standard hidden Markov model framework. Instead of having a separate output model at each state, all of the states share the same 12 neural networks, but the likelihood distributions produced by these nets are contingent on the real-valued location coordinates associated with a state. These coordinates change in a deterministic way as the robot moves, but this does not create any additional problems. In estimating the parameters of the output models, we make one major simplification relative to the Baum-Welch method that is normally used for hidden Markov models (Baum & Eagon, 1967). The correct way to compute the posterior probability distribution over cells given a sequence of vectors of sonar readings is to use the forward-backward algorithm. Repeated application of equations 2.1, 2.2, and 2.3 corresponds to the forward pass, but we ignore the backward pass and so cannot make use of the information that sonar readings at time t provide about the true location of the robot at
698
Sageev Oore, Geoffrey E. Hinton, and Gregory Dudek
times earlier than t. The advantage of our suboptimal method of estimating posterior probabilities is that it can be used online. Also, Neal and Hinton (1993) show that the convergence of expectation maximization does not require the correct posterior distribution to be used. All that is required is that the E-step produce a distribution that has a smaller Kullback-Liebler distance from the true posterior than the previous estimate.5 This is no longer guaranteed when the backward pass is ignored, but it is unlikely that a forward pass using the current parameters in the output models will give a worse estimate of the true posterior given those parameters than a forward pass using the previous parameters. 4.2 Kohonen-Style Self-Organizing Maps. Krose ¨ and Eecen (1994) show that it is possible to compute the location of a robot by first building a topographic map from the vectors of sonar readings alone. The use of a selforganizing map (SOM) partially overcomes the need for labeled training data, but unfortunately, the function that is optimized is not appropriate for the location task. If we view the SOM in generative terms, it attempts to minimize the squared reconstruction error when the input vector is reconstructed from the winning hidden unit or from one of its neighbors. This would be the right thing to minimize if the constraint on sonar readings was that neighboring locations produced similar readings. But this is not true when the robot passes close to the edge of a surface. Provided we know that the movement is small, a much more realistic constraint is that temporally adjacent readings come from neighboring locations. This is precisely the constraint captured by our approach if we treat the movement as random gaussian noise. Obviously, if we know more about the precise movement, we can make this constraint stronger. The relationship between our method and the SOM can be understood by viewing them as two quite different elaborations of a simple mixture model in which each hidden unit has a vector of 12 parameters that represent expected values for the 12 sonar readings. The SOM learning algorithm elaborates this model by forcing neighboring hidden units to have similar parameter vectors. It also uses a hard decision rule for deciding on a winner instead of using posterior probabilities to weight the learning. We elaborate the simple mixture model by using nongaussian output models for each cell in the grid and by tying all these models together to reduce greatly the number of free parameters. Different cells can still predict different probability distributions for the sonar readings because the predictions are conditional on the location coordinates of the cell. Finally, instead of just using the current sonar readings to compute the posterior probability of a cell, we take into account the earlier readings and the motion information, thus allowP
Surprisingly, the relevant Kullback-Liebler distance is Q log(Qα /Pα ) where Q α α is the estimated probability distribution and P is the true posterior. 5
A Mobile Robot that Learns its Place
699
ing the temporal constraints that are ignored by the SOM to influence the learning. Acknowledgments This research was funded by the Institute for Robotics and Intelligent Systems and by the Natural Science and Engineering Research Council. Hinton is the Nesbitt-Burns fellow of the Canadian Institute for Advanced Research. We thank Radford Neal, Allan Jepson, and Brendan Frey for helpful comments. References Baum, L. and Eagon, J. (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bulletin of the American Mathematicians Society, 73. Dudek, G., & Hinton, G. (1992). Navigating without a map by directly transforming sensory input to location. Unpublished manuscript. Dudek, G., Jenkin, M., Milios, E., & Wilkes, D. (1993). Reflections on modelling a sonar range sensor (Tech. Rep. No. CIM-92-9). Montreal: McGill Research Centre for Intelligent Machines, McGill University. Krose, ¨ B., & Eecen, M. (1994). A self-organizing representation of sensor space for mobile robot navigation. In Proceedings of the IEEE/RSJ/GI International Conference on Intelligent Robots and Systems (pp. 9–14). New York: IEEE. Mackay, D. (1991). Bayesian modelling and neural networks. Unpublished doctoral dissertation, California Institute of Technology, Pasadena, CA. Moravec, H. (1988). Sensor fusion in certainty grids for mobile robots. AI Magazine, pp. 61–74. Neal, R., & Hinton, G. (1993). A new view of the EM algorithm that justifies incremental and other variants. Unpublished manuscript. Oore, S. (1995). A mobile robot that learns to estimate its position from a stream of sonar measurements. Unpublished master’s thesis, University of Toronto. Thrun, S., Buecken, A., Burgard, W., Fox, D., Froehlinghaus, D., Henning, D., Hofmann, T., Krell, M., & Schmidt, T. (In press.) Map learning and high-speed navigation in RHINO. In D. Kortenkamp, R. Bonasso, & R. Murphy (Eds.), Case Studies of Successful Robot Systems. Cambridge, MA: MIT Press. Wilkes, D., Dudek, G., Jenkin, M., & Milios, E. (1990). A ray following model of sonar range sensing. In Proc. Mobile Robots V, pp. 536–542. Bellingham, WA: International Society for Optical Engineering.
Received December 5, 1995; accepted August 2, 1996.
REVIEW
Communicated by Terrence J. Sejnowski
Similarity, Connectionism, and the Problem of Representation in Vision Shimon Edelman Sharon Duvdevani-Bar Department of Applied Mathematics and Computer Science, Weizmann Institute of Science, Rehovot 76100, Israel
A representational scheme under which the ranking between represented similarities is isomorphic to the ranking between the corresponding shape similarities can support perfectly correct shape classification because it preserves the clustering of shapes according to the natural kinds prevailing in the external world. This article discusses the computational requirements of representation that preserves similarity ranks and points out the relative straightforwardness of its connectionist implementation. 1 Introduction 1.1 Two Problems of Representation. It is possible to distinguish between two problems about mental representation (Cummins, 1989). The first of these, the problem of representations (plural), is basically empirical; two instances of this problem are the search for the representations employed by natural cognitive systems and the design of the representational substrate for artificial cognitive modules. The second problem is what Cummins calls the Problem of Representation (singular). Here, the central question is how, in principle, an internal state of a system can refer to anything at all in the external world. Curiously, the Problem of Representation is rarely discussed in cognitive psychology. On the one hand, arguments put forward by Marr (1982) prompted cognitive scientists to contemplate the fact that not all representations are computationally equivalent; the classical example is the difficulty of doing arithmetics with roman numerals, compared to the ease of calculation using positional notation. On the other hand, it is generally taken for granted that a certain pattern of neuronal activity in someone’s brain can represent the shape of a cat or, amazingly, that physically distinct patterns of activity in the brains of two people can stand for the same shape (that of a cat). 1.2 First-Order Isomorphism: Representation by Similarity. Reviews in cognitive psychology tend to adopt extreme positions concerning representation. For example, Palmer (1978) asks, “What is representation?” and Neural Computation 9, 701–721 (1997)
c 1997 Massachusetts Institute of Technology °
702
Shimon Edelman and Sharon Duvdevani-Bar
then proceeds with an in-depth survey of the field, the upshot of which is that “we . . . do not really understand our concepts of representation.” In contrast, Suppes, Pavel, and Falmagne (1994) state flatly, “Representation of something is an image, model, or reproduction of that thing.” This amounts to the Aristotelian idea of representation by similarity; according to it, a representation, say, of a cat, can help you know about cats only if it resembles one. Among the problems raised by this notion of representation is the obligatory involvement of an observer: Cartoon cats resemble real cats to us; the two kinds share no properties other than perceptual. Another problem is the inability of representations of this type to deal with abstraction (does having an image of a striped cat qualify as a representation of a calico cat?). Because of these and other considerations, in the philosophy of mind the idea of representation by similarity has been discredited so thoroughly as to inspire a rare consensus (Cummins, 1989). 1.3 Second-Order Isomorphism: Representation of Similarity. Barring resemblance, or “first-order structural isomorphism” (Shepard, 1968), between the object and the entity that stands for it internally, what relationship can qualify as representation of something by something else? Shepard (1968) suggests “second-order” isomorphism: “The isomorphism should be sought—not in the first-order relation between (a) an individual object, and (b) its corresponding internal representation—but in the second-order relation between (a) the relations among alternative external objects, and (b) the relations among their corresponding internal representations. Thus, although the internal representation for a square need not itself be square, it should (whatever it is) at least have a closer functional relation to the internal representation for a rectangle than to that, say, for a green flash or the taste of a persimmon” (Shepard & Chipman, 1970, p. 2). More recently, the idea of representation by second-order isomorphism has been advanced, under various guises, in a number of fields in cognitive science. Typically the researchers in these fields take for granted the implausibility of representation by similarity, that is, by first-order isomorphism. Consequently the theories mention merely “isomorphism,” implying that the isomorphism holds between structures (and is therefore “second order,” in Shepard’s terms), and not between individual entities. A particularly clear exposition of this point can be found in Gallistel (1990), where the example of numerical representation of physical quantities such as mass is discussed. Gallistel points out that “the statement that the number 10 is isomorphic to some particular mass is meaningless. It is the system of numbers that is isomorphic to the system of masses” (p. 24). Gallistel then proceeds to defend the view that cognitive systems represent space and time by harboring structures that are isomorphic to corresponding structures in the world, subject to the qualification regarding the importance of entire structures (that is, elements plus relations),
Similarity, Connectionism, and Representation in Vision
703
as opposed to individual elements. A similar proposal, albeit in the context of representation of conceptual rather than perceptual spaces, can be found in Holland, Holyoak, Nisbett, and Thagard (1986). Here the theory is couched explicitly in terms of a homomorphism between a collection of states and a set of possible transitions between states in the world, and in the internal representation. This version of representation by second-order isomorphism has been discussed very recently in Cummins (1996), where it is, unfortunately, called “The Picture Theory of Representation.”1 Importantly, Cummins echoes Gallistel in pointing out that constituents of a structure are meaningful only in the context of some structure. In the context of visual object representation, this observation translates into the notion that only certain relationships between the objects—not the shapes of the individual objects themselves—need to be represented. What could these relationships be? Shepard’s example, stated in the preceding section, clearly suggests that the represented structure should be that of similarity relationships between objects. Accordingly, this approach can be termed “representation of similarity” (see Figure 1). Our goal in the rest of this article is to outline some of the mathematical issues raised by this approach to representation and to examine its viability in the context of shape vision. 2 Toward Veridical Representation 2.1 Preliminaries. Let us now identify mathematical conditions under which isomorphism between distal and proximal relations gives rise to representation, which is, in a sense, true to the original. Let D be the set of distal objects and P the set of internal tokens that participate in the process of representation. Let f : D ⊂ D → X and g: P ⊂ P → Y be functions defined, respectively, over sets of distal and proximal entities (no restriction needs to be placed at this stage on the number of arguments of f, g). According to the requirement of second-order isomorphism, D is isomorphic to P under a mapping M if ∀D ⊂ D, f (D) ∼ g(M(D)), where the relation ∼ is that of set isomorphism, defined over X , Y . To constrain the choice of M, we have to be more specific in defining the functions f, g. In the context of shape perception, it is natural to consider similarity for this purpose (actually, two similarity functions must be defined—one for the external objects and one for the internal tokens standing for those objects). Intuitively, we would like the representation to capture the similarity relationships within and between natural kinds (Quine, 1969). More specifically, similarity is seen as relevant to both recognition, in which case resemblance between
1 Unfortunately, because in vision research, pictorial representations are traditionally associated with the Aristotelian notion of representation by similarity, or first-order isomorphism.
704
Shimon Edelman and Sharon Duvdevani-Bar
@@ @@ G
G1
3 C1
G2
ρd(G1,G2) < ρd(G2,C1)
distal proximal
ρp(g1,g2) < ρp(g2,c1) g1
g3
@@ @@
c1
g2
Figure 1: Clustering by natural kinds and its representation that fulfills the requirement of second-order isomorphism (here, isomorphism between distance ranks in the two spaces; ρd and ρp are the distal and the proximal distance functions).
the viewed shape and some previously seen ones is to be assessed, and classification, where similarities between a shape and a number of shape classes are compared (Nosofsky, 1988; Goldstone, 1994). For a pair of internal representations, similarity is not directly observable but can still be defined operationally as the degree to which an optimal stimulus for one token activates the other one (Shepard & Chipman, 1970). As noted by Nicod (1930) and Quine (1973), similarity should be construed as a triadic relation: Knowing that A is more similar to B than to C is much more informative than merely knowing that A is similar to B and B to C. Furthermore, we may assume that similarity is defined qualitatively and not quantitatively, that is, the range of f, g is X = Y = {>, <}. This assumption limits neither classification (because knowledge of similarity for every triplet of points belonging to a metric space defines unequivocally the clustering of the points) nor recognition (because the location of each point can be recovered from triadic similarities using nonmetric multidimensional scaling; see section 3.1). We thus obtain f : D × D × D → {>, <} and, analogously, g: P × P × P → {>, <}. The condition on M then be-
Similarity, Connectionism, and Representation in Vision
705
comes ∀D1 , D2 , D3 ∈ D, f (D1 , D2 , D3 ) = g(M(D1 ), M(D2 ), M(D3 )). In other words, the mapping M is required to preserve similarity ranks. 2.2 Constraints on the Distal Structure. The notion of similarity can be naturally quantified if both the distal shape space and the internal representation space are endowed with metric structure, in which case (dis)similarity in each space can be derived from the appropriate distance function. In the distal space, the existence of a metric for a given shape family (i.e., a subspace of the shape space) depends on the possibility of coming up with at least one parameterization scheme common to all the shapes in the family (the issue of different parameterizations of the same set of shapes is discussed in section 2.3.3). A metric on shapes can then be defined simply as the Euclidean distance in the underlying parameter space Rn . Obviously only a small part of that space is occupied by objects that “look like something”—a combination of parameters picked at random is likely to define a nondescript collection of voxels. It is important to realize, therefore, that the goal behind the introduction of parameters is to formalize the idea of an “objective” similarity for shapes, and not to invent a language in which only some legitimate objects possess valid descriptions. With that qualification, the task of defining an appropriate parameterization becomes quite easy, and can even be automated to a certain extent. The performance of one possible approach to a common parameterization of diverse shapes is illustrated in Figures 2, 3, and 4. According to this approach, the parameterization is computed in a straightforward manner by assigning a dimension to each voxel in a fine subdivision of the bounding box of the object. As a result, it becomes possible to consider various objects as points in a common parameter space and to morph (smoothly deform) one object into another, as shown in Figure 3 (this process corresponds to a movement in the parameter space along the line that connects the points corresponding to the two objects). Although the voxel-based parameter space is high-dimensional, a small number of dimensions may suffice to specify complex and diverse shapes, as illustrated in Figure 4, which examines the utility of principal component analysis on the space of shape parameters. Note that the possibility of a common parameterization of a variety of object shapes provides a substrate for an objective similarity structure of the world, to be reflected internally by the representational process. Thus, it fulfills the first basic prerequisite for representation by second-order isomorphism. The second prerequisite, having to do with the manner whereby the distal similarities are mapped into the representational system, is the topic of the next section. 2.3 Constraints on the Distal-to-Proximal Mapping. 2.3.1 Distance Rank Preservation. As we saw in the beginning of this section, preservation of the ranks of parameter-space distances among ob-
706
Shimon Edelman and Sharon Duvdevani-Bar
Figure 2: (Top) Images of several three-dimensional objects. (Middle) The reconstruction of the objects using linear resolution of 10 divisions per side of the bounding box (implying a 103 -dimensional parameter space, as explained below). (Bottom) Reconstruction under linear resolution of 30 divisions per side of the bounding box (303 parameters). Given a linear resolution parameter u, the parameterization program subdivides the bounding box of the object into u × u × u voxels. The vector of occupancy indices for the voxels thus constitutes a geometrical description of the object in terms of u3 parameters. The resulting parameter space is universal (common to all the objects), as illustrated in Figure 3, where the common parameterization is used to produce morph sequences between different objects.
Figure 3: Successive stages in a morphing (smooth deformation) process linking distinct points in a common shape space. (Top) Morphing a teapot into an apple. (Bottom) Morphing an airplane into a shark. Both examples of morphing used a 303 -dimensional parameterization of the objects.
Similarity, Connectionism, and Representation in Vision
707
Figure 4: (Top) Images of several three-dimensional objects. (Middle) Images of the same objects, parameterized with 253 parameters and rerendered. (Bottom) Images rendered from representations of the objects in a common fivedimensional parameter space, obtained from the high-dimensional voxel-based space using principal component analysis. The weights of the five leading principal components are: knight al canstick shell ape pawn
−0.19 −12.01 −2.96 19.36 −4.62 0.43
4.60 −16.19 7.65 −7.52 6.00 5.45
6.82 1.00 1.85 −2.37 −13.46 6.15
5.37 −0.53 −12.87 −1.47 3.79 5.72
6.94 −0.32 −0.08 0.08 0.30 −6.91
jects by the distal-to-proximal mapping suffices for a faithful representation of their similarity structure. The implications of this requirement for the choice of distal-to-proximal mapping depend on whether ranks need to be preserved fully or partially. Strict rank preservation. The similarity ranking of three distinct parameter vectors x, y, z ∈ Rn can be formalized by the notion of their simple ratio, defined as hx , y , zi = |x − y|/|y − z|. Obviously a mapping S: Rn → Rn preserves distance ranks iff it preserves hx , y , zi for any choice of points. A bijective mapping S with this property must be a similitude, that is, a mapping of the form S(x) = λ P(x), where λ > 0, λ ∈ R, and P: Rn → Rn is an orthogonal transformation (Reshetnyak, 1989). Thus, the requirement of rank preservation is quite restrictive in the class of mappings it allows. Local rank preservation. In one or two dimensions, the rank preservation requirement is satisfied locally by any well-behaved mapping. Specifically,
708
Shimon Edelman and Sharon Duvdevani-Bar
a mapping realized by an analytic function with a nonvanishing Jacobian in a given region is conformal there (Cohn, 1967). Such a function preserves similitude of small triangles; in particular, a scalene triangle formed by a triplet of points will be mapped into a triangle with the same ranking of side lengths (see Figure 1). In higher dimensions, conformality is much more restrictive. As proved by Liouville in 1850, already for n = 3 there are no globally conformal mappings from Rn to itself besides those composed of finitely many inversions with respect to spheres. Such mappings, called Mobius ¨ transformations, constitute a finite-dimensional Lie group, which includes the group of motions in Rn and is only slightly broader than that group (Reshetnyak, 1989). Local approximate rank preservation. A considerably broader class of mappings emerges if the requirement of conformality is replaced by that of quasiconformality. Intuitively, a regular topological mapping is quasiconformal if there exists a constant q, 1 ≤ q ≤ ∞, such that almost any infinitesimally small sphere is transformed into an ellipsoid for which the ratio of the largest semiaxis to the smallest one does not exceed q (Reshetnyak, 1989). Under such a mapping, the ranks of distances between points are preserved approximately, on a small scale (V¨ais¨al¨a, 1992, p. 124). 2.3.2 The Implications of Locality. How relevant is local approximate preservation of distance ranks, offered by a quasiconformal distal-to-proximal mapping, to the representation of real-world shapes? An answer to this question is suggested by the realization that distance ranks need not be preserved globally across the entire shape space. The limits to the extent of distance rank preservation expected of a perceptual system are suggested by the hierarchical structure of perceived categories, revealed in numerous studies in cognitive science. Within this hierarchy, there exists a certain level, called basic level, which is the most salient according to a variety of criteria (Rosch, Mervis, Gray, Johnson, & Boyes-Braem, 1976). Taking as an example the hierarchy quadruped, dog, terrier, the basic level is that of dog. Objects whose recognition implies less detailed distinctions than those required for basic-level categorization are said to belong to a superordinate level. A reasonable assumption is that faithful representation of similarities is required within superordinate categories (e.g., within quadruped: giraffe to camel, leopard, horse) and, of course, within basic categories, but not between superordinate categories. In other words, the informativeness of the statement that a giraffe is more similar to a banana than to a beetle is questionable. At the same time, one expects a giraffe to be represented as more similar to other members of the quadruped category than to either a banana or a beetle.2 2
Similarity across superordinate category borders may make sense if the comparison
Similarity, Connectionism, and Representation in Vision
709
2.3.3 The Components of the Distal-to-Proximal Mapping M. A mapping implemented by a perceptual system can be described generically as M = dimens ◦ measure ◦ imag ◦ geom, where the first two components are dictated by the properties of the world and the other two constitute part of the system: Geometry. geom(p): The shape of the object, as a function of the distal parameter-space description p. Imaging. imag(p; z): The image on the receptor surface, as a function of the shape parameters p and the viewing conditions z. Measurements. measure(p; z): The set of internal measurements performed on the image. Dimensionality reduction. dimens(p): The low-dimensional representation, independent of z. Let us now examine the constraints imposed on the mapping M by the requirement of second-order isomorphism between similarity patterns in its domain and its range. Note that some of the components of M are not under the control of the perceptual system while others are. The components determined by the system’s environment are likely to have a detrimental effect on the proper representation of similarity, for example, by introducing extraneous sources of variability into the representations. Each of the components is therefore worth considering separately, to determine whether the controllable components can compensate for the effects of the uncontrollable ones. Consider first the effect of the geometry mapping, geom. Although a perceptual system has no control over this mapping, no such control is necessary: Parameterizations that are “unnatural” in the sense that they encode the geometry of objects in a nonsmooth manner will be excluded from possible recovery by perceptual systems based on the proposed principle, but there will always be plenty of other, “natural” parameterizations that will be amenable to veridical representation. Let P be the set of all parameterizations related to a given one p0 by some conformal mapping T. The set P is an equivalence class (V¨ais¨al¨a, 1971); moreover, because the composition T ◦ M is conformal if M is, veridical representation of some p ∈ P is equivalent to the representation of any other p ∈ P . Now a conformal mapping M will give rise to a proper (i.e., second-order isomorphic) representation of object clustering under all parameterizations belonging to some class Px . The nature of that class will depend on the nature of the mapping (which can emphasize some distances among objects at the expense of others, with or without altering the concentrates on a few isolated dimensions, which differ from one case to another. For example, giraffes may be found similar to “tall” things (a palm tree, a lamp post) or to “spotted” things (a leopard, a camouflage net). Typically, similarity in those cases is markedly intransitive (a palm tree is not similar to a leopard). See Tversky (1977).
710
Shimon Edelman and Sharon Duvdevani-Bar
distance ranks). A system that is a product of natural selection is expected to have evolved a mapping more suitable for the representation of those aspects of its habitat most important for its survival and behavior. Thus, while veridical representation is possible, it is also possible that two perceptual systems implementing different mappings will have incompatible (or even conflicting) pictures of the world. The next component of M—the viewing mapping, imag—introduces a dependence on variables z that are extraneous to the shape parameters that are to be represented. These variables encode the pose of the object with respect to the observer, to the light sources, and to the other objects in the scene. Their influence must be counteracted by the perceptual system, through the combined action of measurement and dimensionality reduction, dimens ◦ measure, to reduce the likelihood of two nearby parameterspace points (i.e., two similar shapes) being mapped into widely disparate points in the final representation space. Note that absolute invariance with respect to these variables is not necessary; it is required only that shapespace changes influence the measurements more strongly than view-space changes (more on this in section 3.2). Furthermore, not all the dimensions of z have to be treated by the same mechanism: Image-plane translation can be compensated for by a covert shift of attention (Anderson & Van Essen, 1987) or an overt one (such as a saccadic eye movement), variation in apparent size by global scaling using a hard-wired mechanism (Schwartz, 1985), and rotation in depth by learning an appropriate normalizing mapping specific for each object class (Poggio & Edelman, 1990). In addition to the requirement that dimens ◦ measure counteract the effects of the view-related variables, one may note that the preservation of distance ranks implies that any change in the distal parameter space must be reflected in the final low-dimensional representation (otherwise, if some of the original dimensions collapse under the representation, distances between points are likely to be distorted). To ensure that as many as possible of the original dimensions of variation among the distal objects are preserved, it is worthwhile to make as many varied measurements as possible. This makes the measurement space (defined by measure) high-dimensional and necessitates subsequent dimensionality reduction (through the action of dimens; the existence of a low-dimensional parameterization of the distal shape space constitutes another computational justification for dimensionality reduction). In a flexible system, dimensionality reduction would have to involve learning to find informative dimensions, depending on the statistics of the input and (if available) of additional knowledge provided by the environment (see, e.g., Intrator, 1993). To summarize, the message of this section is not mere reiteration of the importance of invariance or dimensionality reduction. Rather, it is the possibility of achieving formally veridical representation, provided that those two problems are addressed with the proper computational goal in mind. According to the proposed theory of representation, that goal is preservation
Similarity, Connectionism, and Representation in Vision
711
of parametrically defined similarity by the distal-to-proximal mapping. The next section presents empirical evidence in support of the proposed theory. Because our emphasis is on brevity and variety, we make only brief mention of selected results from psychophysics, computational modeling, and neurophysiology of object representation. 3 Empirical Support 3.1 Psychophysics. In his discussion of cognitive representation, Palmer (1978) notes that “in order to test the two models [first- and second-order isomorphism], one must have access to the mapping function from the outside world” (p. 293). Such access can be realized by combining parametric shape generation with a psychological technique known as multidimensional scaling (MDS) (Kruskal & Wish, 1978; Shepard, 1980). Nonmetric MDS is an algorithm for embedding a set of points in a metric space while preserving as closely as possible the ranking of pairwise distances between the points, specified in advance (in psychophysical applications, this ranking is derived from the subject’s response data). A number of recent works have explored the veridicality of parametric shape representation using MDS (Edelman, 1995a; Cutzu & Edelman, 1996; Sugihara, Edelman, & Tanaka, 1996). In each of a series of experiments that involved pairwise similarity judgment and delayed matching to sample, subjects were confronted with several classes of computer-rendered three-dimensional animal-like shapes, arranged in a complex pattern in a common high-dimensional parameter space (see Figure 5, left). Response time and error rate data were combined into a measure of subjective shape similarity, and the resulting proximity matrix was submitted to nonmetric MDS. In the resulting solution, the relative geometrical arrangement of the points corresponding to the different objects invariably reflected the complex low-dimensional structure in parameter space that defined the relationships between the stimuli classes (see Figure 5, middle). 3.2 Computation. 3.2.1 From Shepard’s Law of Generalization to Veridical Representation of Similarities. The results of the psychophysical experiments described above indicate that subjects internalize a veridical replica of the parameter-space configuration formed by the stimuli, which in turn can be recovered from the response data. Interestingly, the same mechanism that has been invoked in the past as a model of simple generalization with respect to a reference stimulus—a classifier whose response falls off exponentially with the distance between its preferred stimulus and the current input (Shepard, 1987; Shepard & Kannappan, 1993)—can also serve as a basis for modeling veridical representation of the shape space in humans, as explained below.
712
Shimon Edelman and Sharon Duvdevani-Bar
A monotonic falloff of the likelihood of generalization with increasing distance between the test and the reference stimuli has been observed in a wide range of experiments, spanning visual and other modalities. Shepard (1987) surveyed empirical data on generalization gradients and showed that they can be described by a decaying exponential, P(x; x0 ) = e−d(x,x0 )/σ , where P(x; x0 ) is the probability of generalizing to stimulus x the response previously made to x0 , and d(x, x0 ) is the normalized distance in psychological (representation) space. Shepard (1987) also showed that this exponential generalization law can be derived from first principles and is relatively insensitive to the form of the distance function entering the computation. 3.2.2 A Multiple-Classifier Model. Now consider a bank of k classifiers, each tuned to a particular point in the distal space (its optimal stimulus). Together these classifiers implement a mapping M from the distal shape space to the proximal representation space, Rk . To fulfill the requirements for veridical representation discussed in section 2, the response of each classifier must be approximately constant over the range of different viewing conditions z and must fall off gradually and monotonically with parameterspace distance from its stimulus, as in Shepard’s law of generalization. Such a system realizes a map M that is smooth and regular and can therefore serve
4
2
2
1.5
1.5
5
2
3
1
9 8
7
6
1
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5
−1.5
−2
cross
−2 −2
−1.5
−1
−0.5
0
0.5
1
1.5
2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Figure 5: (Left) The parameter-space configuration used for generating the stimuli in one of the experiments described in Cutzu and Edelman (1996). (Middle) The two-dimensional MDS solution for the subject data. Symbols: ◦: true configuration; ×: configuration derived by MDS from the subject data, then Procrustes transformed (that is, translated, rotated, and scaled; see Borg & Lingoes, 1987) to fit the true one. Lines connect corresponding points. The residual (Procrustes) distance between the MDS-derived configuration and the true one was 0.66 (expected random value, estimated by bootstrap from the data: 3.14 ± 0.15, mean and standard deviation; 100 permutations of the point order were used in the bootstrap computation). (Right) The two-dimensional MDS solution for the radial basis function (RBF) ensemble model (see section 3.2); Procrustes distance: 1.11 (expected random value: 3.14±0.17). Simulations showed that the recovery of the low-dimensional structure from image-space distances between the stimuli (as opposed to their representation in the RBF ensemble space) was impossible. See Cutzu and Edelman (1996) for details.
Similarity, Connectionism, and Representation in Vision
713
as a substrate for veridical representation of the original parameter space (any diffeomorphism restricted to a compact subset of its domain is quasiconformal; Zorich 1992, p. 133). A model of this type fully replicated the psychophysical results mentioned in the preceding section. Specifically, the MDS-derived configurations obtained from the performance data of the model showed significant resemblance to the true parameter-space configurations (see Figure 5, right). The model consisted of a number of classifiers (three or four, depending on the experiment), each incorporating a nearly viewpoint-invariant representation of a reference object, formed by training a radial basis function (RBF) network to interpolate a characteristic function for that object (Poggio & Edelman, 1990). Each RBF network was trained to output a constant value for a selection of views of the reference object; the views were encoded by the activities of an underlying receptive field layer. As a result, the classifier responded nearly equally to different views of the reference object it had been trained on and progressively less to objects that differed from the reference one in the shape space (see Figure 6). At the RBF ensemble level, the similarity between two stimuli was defined as the cosine of the angle between the vectors of outputs they evoked in the RBF modules. 3.2.3 Extension to a Diverse Collection of Objects. The faithful representation of interobject similarities acquires a somewhat different meaning if objects that do not belong to a homogeneous class are considered. In that case, it is not reasonable to expect that a system reflect any “proper” pattern of similarities inherent in the stimuli, simply because no such pattern exists for a collection of objects that is diverse enough (cf. section 2.3.2). Note, however, that the requirement of discounting the effects of the extraneous variables (associated with viewpoint, illumination, etc.) still stands. According to this requirement, a representational system must place views of the same object closer together in its internal shape space than views of different objects. Figure 7 illustrates the ability of the multiple-classifier model to do just that. As a control, we first employed MDS to arrange four views of each of three objects (a human figure, a chess piece, and a candlestick) in two dimensions, according to their distances in the space spanned by a collection of receptive fields that covered the images. The left panel of Figure 7 shows that views in the resulting configuration were arranged by orientation of the object, not by object identity. As in the model of the psychophysical findings already described, the correct clustering (by object identity) was obtained when the distances submitted to MDS had been computed in the space of activities of three RBF classifiers, each trained on other views of one of the objects (see Figure 7, right). 3.2.4 More on the Multiple-Classifier Approach. In addition to its ability to replicate human performance in the perception of shape similarities, the
714
Shimon Edelman and Sharon Duvdevani-Bar
Figure 6: The response of a radial basis function module, trained on 10 random views of a parametrically defined object, to stimuli differing from a reference view of that object (marked by the large circle), in three ways: (1) by progressive view change, marked by ◦; (2) by progressive shape change, marked by ×; (3) by combined shape and view change, marked by ∗. The points along each curve have been sorted by pixel-space distance between the test and the reference stimuli (shown along the abscissa). Points are means over 10 repetitions with different random view-space and shape-space directions of change; a typical error bar (± standard error of the mean) is shown for each curve. Note the insensitivity of the module’s output to view-space changes, relative to shapespace changes.
multiple-classifier model described above possesses a number of other useful properties (Edelman, 1995b; Edelman, Cutzu, & Duvdevani-Bar, 1996). First, it is related to a powerful method of dimensionality reduction (Bourgain, 1985; Linial, London, & Rabinovich, 1994), in which points belonging to a multidimensional space are embedded into a space of much lower dimensionality, while preserving to a large extent the original interpoint distances. The importance of this relationship stems from the need for drastic dimensionality reduction that arises in biological visual systems, where the dimensionality of intermediate representation spaces (at the level of the optic nerve, for example) may be counted in the hundreds of thousands. Second, representing a stimulus by the responses of classifiers, each of which
Similarity, Connectionism, and Representation in Vision
715
Figure 7: (Left) The arrangement in two dimensions of different views of three objects, derived by MDS from distances computed in the space of activities of receptive fields covering the images (recall that the MDS procedure places views that are similar under the given metric nearby each other). Two hundred receptive fields, intended to serve as a crude model of the primate primary visual area, were used in this simulation. Note that the views are clustered by the orientation of the object. (Right) The arrangement of the same views according to their distances in the space spanned by the outputs of three RBF classifiers, each trained on one of the objects (see sections 3.2 and 3.2.3). Note the clustering by object identity.
is nonlinear (e.g., exponential, as in Shepard’s law) in the measurementspace distance between the stimulus and some reference object, introduces features polynomially related to the original measurement variables.3 This may simplify subsequent classification by nonlinearly enriching the representation space and making the possibility of linear separation between classes more likely (Cortes & Vapnik, 1995). Third, responses of a number of classifiers can serve as a substrate for carrying out classification at superordinate, basic, or subordinate levels of categorization, depending on the manner in which these responses are processed. For example, summing the responses of classifiers tuned to members of a certain class gives a measure of the affinity of the stimulus to that class (Nosofsky, 1988), while considering the responses separately can allow fine individual-level distinction between stimuli. Fourth, if the saliency of individual classifiers in distinguishing between various stimuli is kept track of and is taken into consideration depending on the task at hand, then similarity between stimuli in the representation space can be made asymmetrical and nontransitive, in accordance with Tversky’s general contrast model of similarity (Tversky, 1977).
3 Think of the Taylor expansion of e−kx−x0 k around x , where x is the stimulus and x 0 0 is a reference object.
716
Shimon Edelman and Sharon Duvdevani-Bar
It is interesting to compare the multiple-classifier approach to object representation to other recent proposals that avoid geometric reconstruction of individual objects. A typical example of the latter is representation by view interpolation. Because different views of a three-dimensional object are known to lie on a smooth low-dimensional manifold embedded in the space of all views of all objects (Ullman & Basri, 1991), it is possible to employ standard function approximation techniques such as RBFs to determine whether an image belongs to a given object, the latter being represented merely by a set of its views. Indeed, this technique is incorporated into each classifier module in our multiple-classifier system. Note, however, that interpolation is useless if the object is novel. In fact, an interpolation module cannot be trained to produce the “right output” for novel objects because such a thing is undefined. In contrast, the multiple-classifier system can still provide useful information in such a case (by signaling, e.g., the category affiliation of the stimulus rather than its identity, as suggested in the previous paragraph). In a sense, this is made possible by mimicking, on a higher level, the behavior of systems that carry out multiple feature measurements over the input image and then classify it according to the histograms computed for the different features (Swain & Ballard, 1991; Mel, 1996); in our case, the measurements are similarities to entire objects. We conjecture that a combination of the two approaches with top-down processes capable of anchoring the representation activated by a stimulus in the retinal and the world coordinate frames would lead to the development of more comprehensive theories of vision.
3.3 Neurophysiology. The approach to representation based on a smooth distal-to-proximal mapping and its implementation by the bank of classifiers, outlined in the preceding section, lead to explicit predictions regarding the processing of objects at the higher levels of the primate visual system. Specifically, one expects to find there units responding preferentially to certain objects, with the response falling off monotonically with dissimilarity between the stimulus and the preferred object, while staying nearly constant over different views of the preferred object (cf. Figure 6). Although first reports of cells in the monkey inferotemporal cortex that respond preferentially to faces date back more than two decades (Gross, Rocha-Miranda, & Bender, 1972), cells tuned to general objects have been found only recently. In particular, Tanaka and his group reported the desired selectivity for specific (mostly two-dimensional) objects in recordings from the inferotemporal cortex (Fujita, Tanaka, Ito, & Cheng, 1992; Kobatake & Tanaka, 1994; Tanaka, 1992, 1993). More recently, Logothetis, Pauls, and Poggio (1995) reported recordings from cells tuned to specific views of threedimensional objects on which the monkey had been trained. A small proportion of the object-tuned cells they found responded each to a limited subset of the objects, irrespective of view.
Similarity, Connectionism, and Representation in Vision
717
None of the above experiments involved parametric manipulation of the stimulus shape—a crucial component in testing the predictions of the theory of representation proposed here. In another study, where such manipulation was attempted, the stimuli were complex, parametrically defined, periodic two-dimensional patterns (Sakai, Naya, & Miyashita, 1994). In that study, the response of cells was found to decrease monotonically with parameter-space distance between the test stimulus and the preferred pattern to which the cells were tuned. With parametrically controlled threedimensional stimuli, it should be possible to look for cells that behave similarly to the RBF module whose behavior is illustrated in Figure 6. Such a cell would respond equally to different views of its preferred object, but its response would decrease with parameter-space distance from the point corresponding to the shape of the preferred object. Furthermore, the responses of a number of such cells, each tuned to a different reference object, should carry sufficient information for classifying novel stimuli of the same general category. Finally, if the pattern of stimuli has a simple low-dimensional characterization in some underlying parameter space (as in Figure 5, left), it should be recoverable from the ensemble response of a number of cells, using multidimensional scaling.
4 Conclusions A lawful correspondence between proximities of objects in an internal representation space and their distal similarities offers a principled solution to the Problem of Representation by showing that the internal space can mirror the visual world in a mathematically precise and behaviorally significant sense. This view of representation, derived from R. N. Shepard’s (1968) idea of second-order isomorphism between the representation space and the world, possesses a number of advantages over the simplistic alternative (“pictures in the head”). It also constitutes a plausible model of representation in primate vision, as indicated by a series of recent psychophysical results. We have seen that faithful representation by second-order isomorphism can be supported by a system of classifiers whose graded-profile overlapping “receptive fields” are tuned to specific object classes in a common parametrically defined shape space. In sensory physiology, the notion of functional units selectively and reliably responsive to external stimuli (units whose output covaries with the sensory input in a well-defined and predictable manner) has found a wide acceptance in the form of the feature detector doctrine (Barlow, 1979). A better understanding of the formal capabilities of a system of feature detectors in the context of shape vision and of its performance as a psychological model should help this doctrine evolve into a full-fledged theory of representation.
718
Shimon Edelman and Sharon Duvdevani-Bar
Acknowledgments Thanks to A. Aertsen, G. Cottrell, F. Cutzu, N. Intrator, D. Mumford, T. Poggio, R. Shepard, and S. Ullman for useful discussions and suggestions and to the two anonymous reviewers for constructive criticism. The experiments described in Figure 5 were carried out by F. Cutzu. K. Grill-Spector helped with the development of the shape-morphing procedures. S.E. holds the Sir Charles Clore Career Development Chair at the Weizmann Institute of Science. References Anderson, C. H., & Van Essen, D. C. (1987). Shifter circuits: A computational strategy for dynamic aspects of visual processing. Proceedings of the National Academy of Science, 84, 6297–6301. Barlow, H. B. (1979). The past, present and future of feature detectors. In D. Albrecht (Ed.), Recognition of Pattern and Form (pp. 4–32). Berlin: Springer. Borg, I., & Lingoes, J. (1987). Multidimensional similarity structure analysis. Berlin: Springer-Verlag. Bourgain, J. (1985). On Lipschitz embedding of finite metric spaces in Hilbert space. Israel J. Math., 52, 46–52. Cohn, H. (1967). Conformal mappings on Riemann surfaces. New York: McGrawHill. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297. Cummins, R. (1989). Meaning and mental representation. Cambridge, MA: MIT Press. Cummins, R. (1996). Representations, targets, and attitudes. Cambridge, MA: MIT Press. Cutzu, F., & Edelman, S. (1996). Faithful representation of similarities among three-dimensional shapes in human vision. Proc. Natl. Acad. Sci., 93, 12046– 12050. Edelman, S. (1995a). Representation of similarity in 3D object discrimination. Neural Computation, 7, 407–422. Edelman, S. (1995b). Representation, similarity, and the chorus of prototypes. Minds and Machines, 5, 45–68. Edelman, S., Cutzu, F., & Duvdevani-Bar, S. (1996). Similarity to reference shapes as a basis for shape representation. In G. Cottrell (Ed.), Proceedings of COGSCI’96, San Diego, CA, pp. 260–265. Fujita, I., Tanaka, K., Ito, M., & Cheng, K. (1992). Columns for visual features of objects in monkey inferotemporal cortex. Nature, 360, 343–346. Gallistel, C. R. (1990). The organization of learning. Cambridge, MA: MIT Press. Goldstone, R. L. (1994). The role of similarity in categorization: Providing a groundwork. Cognition, 52, 125–157. Gross, C. G., Rocha-Miranda, C. E., & Bender, D. B. (1972). Visual properties of cells in inferotemporal cortex of the macaque. J. Neurophysiol., 35, 96–111.
Similarity, Connectionism, and Representation in Vision
719
Holland, J. H., Holyoak, K. J., Nisbett, R. E., & Thagard, P. R. (1986). Induction: Processes of inference, learning, and discovery. Cambridge, MA: MIT Press. Intrator, N. (1993). Combining exploratory projection pursuit and projection pursuit regression. Neural Computation, 5, 443–455. Kobatake, E., & Tanaka, K. (1994). Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex. J. Neurophysiol., 71, 2269–2280. Kruskal, J. B., & Wish, M. (1978). Multidimensional scaling. Beverly Hills, CA: Sage. Linial, N., London, E., & Rabinovich, Y. (1994). The geometry of graphs and some of its algorithmic applications. FOCS, 35, 577–591. Logothetis, N. K., Pauls, J., & Poggio, T. (1995). Shape recognition in the inferior temporal cortex of monkeys. Current Biology, 5, 552–563. Marr, D. (1982). Vision. San Francisco: Freeman. Mel, B. (1996). SEEMORE: Combining color, shape, and texture histogramming in a neurally-inspired approach to visual object recognition (Tech. Rep.). Los Angeles: University of Southern California. Nicod, J. (1930). Foundations of geometry and induction. London: Routledge & Kegan Paul. Nosofsky, R. M. (1988). Exemplar-based accounts of relations between classification, recognition, and typicality. Journal of Experimental Psychology: Learning, Memory and Cognition, 14, 700–708. Palmer, S. E. (1978). Fundamental aspects of cognitive representation. In E. Rosch & B. B. Lloyd (Eds.), Cognition and Categorization (pp. 259–303). Hillsdale, NJ: Erlbaum. Poggio, T., & Edelman, S. (1990). A network that learns to recognize threedimensional objects. Nature, 343, 263–266. Quine, W. V. O. (1969). Natural kinds. In W. V. O. Quine (Ed.), Ontological Relativity and Other Essays (pp. 114–138). New York: Columbia University Press. Quine, W. V. O. (1973). The roots of reference. La Salle, IL: Open Court. Reshetnyak, Y. G. (1989). Space mappings with bounded distortion. Providence, RI: American Mathematical Society. Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., & Boyes-Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8, 382–439. Sakai, K., Naya, Y., & Miyashita, Y. (1994). Neuronal tuning and associative mechanisms in form representation. Learning and Memory, 1, 83–105. Schwartz, E. L. (1985). Local and global functional architecture in primate striate cortex: outline of a spatial mapping doctrine for perception. In D. Rose, and V. G. Dobson (Eds.), Models of the Visual Cortex (pp. 146–157). New York: Wiley. Shepard, R. N. (1968). Cognitive psychology: A review of the book by U. Neisser. Amer. J. Psychol., 81, 285–289. Shepard, R. N. (1980). Multidimensional scaling, tree-fitting, and clustering. Science, 210, 390–397. Shepard, R. N. (1987). Toward a universal law of generalization for psychological science. Science, 237, 1317–1323.
720
Shimon Edelman and Sharon Duvdevani-Bar
Shepard, R. N., & Chipman, S. (1970). Second-order isomorphism of internal representations: Shapes of states. Cognitive Psychology, 1, 1–17. Shepard, R. N., & Kannappan, S. (1993). Connectionist implementation of a theory of generalization. In S. J. Hanson, J. D. Cowan, and C. L. Giles (Eds.), Advances in Neural Information Processing Systems 5 (pp. 665–672). San Mateo, CA: Morgan Kaufmann. Sugihara, T., Edelman, S., & Tanaka, K. (1996). Representation of objective similarity among 3D shapes in the monkey. Invest. Ophthalm. Vis. Sci. Suppl. (Proc. ARVO). Suppes, P., Pavel, M., & Falmagne, J. (1994). Representations and models in psychology. Ann. Rev. Psychol., 45, 517–544. Swain, M. J., & Ballard, D. H. (1991). Color indexing. International Journal of Computer Vision, 7, 11–32. Tanaka, K. (1992). Inferotemporal cortex and higher visual functions. Current Opinion in Neurobiology, 2, 502–505. Tanaka, K. (1993). Neuronal mechanisms of object recognition. Science, 262, 685– 688. Tversky, A. (1977). Features of similarity. Psychological Review, 84, 327–352. Ullman, S., & Basri, R. (1991). Recognition by linear combinations of models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 992–1005. V¨ais¨al¨a, J. (1971). Lectures on n-dimensional quasiconformal mappings. Berlin: Springer-Verlag. V¨ais¨al¨a, J. (1992). Domains and maps. In M. Vuorinen (Ed.), Quasiconformal Space Mappings (pp. 119–131). Berlin: Springer-Verlag. Zorich, V. A. (1992). The global homeomorphism theorem for space quasiconformal mappings. In M. Vuorinen (Ed.), Quasiconformal Space Mappings (pp. 132– 148). Berlin, Springer-Verlag.
Received October 6, 1995; accepted June 18, 1996.
ARTICLE
Communicated by David Mumford
Dynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual Cortex Rajesh P. N. Rao Dana H. Ballard Department of Computer Science, University of Rochester, Rochester, NY 14627-0226, USA
The responses of visual cortical neurons during fixation tasks can be significantly modulated by stimuli from beyond the classical receptive field. Modulatory effects in neural responses have also been recently reported in a task where a monkey freely views a natural scene. In this article, we describe a hierarchical network model of visual recognition that explains these experimental observations by using a form of the extended Kalman filter as given by the minimum description length (MDL) principle. The model dynamically combines input-driven bottom-up signals with expectation-driven top-down signals to predict current recognition state. Synaptic weights in the model are adapted in a Hebbian manner according to a learning rule also derived from the MDL principle. The resulting prediction-learning scheme can be viewed as implementing a form of the expectation-maximization (EM) algorithm. The architecture of the model posits an active computational role for the reciprocal connections between adjoining visual cortical areas in determining neural response properties. In particular, the model demonstrates the possible role of feedback from higher cortical areas in mediating neurophysiological effects due to stimuli from beyond the classical receptive field. Simulations of the model are provided that help explain the experimental observations regarding neural responses in both free viewing and fixating conditions. 1 Introduction Much of the information regarding the response properties of cells in the primate visual cortex has been obtained by displaying visual stimuli within a cell’s receptive field while a monkey fixates on a blank screen and observing, via microelectrode recordings, the neuronal responses elicited by the displayed stimuli (Hubel & Wiesel, 1968; Zeki, 1976; Baizer, Robinson, & Dow, 1977; Andersen, Essick, & Siegel, 1985; Maunsell & Newsome, 1987).
Note: A preliminary account of this work appeared as Rao and Ballard (1995b). Neural Computation 9, 721–763 (1997)
c 1997 Massachusetts Institute of Technology °
722
Rajesh P. N. Rao and Dana H. Ballard
The initial assumption was that the responses thus recorded are not substantially different from those generated in a more natural setting, but later experiments showed that the responses of many visual cortical neurons can be significantly modulated by stimuli from beyond the classical receptive field (Nelson & Frost, 1978; Zeki, 1983; Allman, Miezin, & McGuinness, 1985; Desimone & Schein, 1987; Guylas, Orban, Duysens, & Maes, 1987; Muller, Singer, Krauskopf, & Lennie, 1996). Modulatory effects in neural responses have also been recently reported by Gallant, Connor, and Van Essen (1994), Gallant, Connor, Drury, and Van Essen (1995), and Gallant (1996) in a task where a monkey freely views a natural scene. In these experiments, responses from the visual areas V1, V2, and V4 were recorded while a monkey freely viewed natural images. The image subregions that fell within a cell’s receptive field during free viewing were then extracted from the images and flashed within the cell’s receptive field while the monkey fixated on a blank screen. Surprisingly, the responses obtained when the stimuli were flashed during the fixation task were found to be generally much larger than when the same stimuli entered the receptive field during free viewing. In this article, we describe a hierarchical network model of dynamics and synaptic learning in the visual cortex that provides explanations for these experimental observations by exploiting two fundamental properties of the cortex: (1) the reciprocity of connections between any two connected areas (Rockland & Pandya, 1979; Van Essen, 1985; Felleman & Van Essen, 1991), and (2) the convergence of inputs from lower-area neurons to a higher-area neuron, resulting in larger receptive field sizes for neurons in the higher area (Van Essen & Maunsell, 1983; Desimone & Ungerleider, 1989). The reciprocity of connections allows feedback pathways to convey top-down reconstructed signals that serve as predictions for lower-level modules, while the feedforward pathways convey the residuals between current recognition state and the top-down predictions from the higher level. The larger receptive field size of higher-level neurons allows them to integrate information over a larger spatial extent, which in turn permits them to influence, via feedback pathways, lower-level units operating at a smaller spatial scale. The model acknowledges the fact that vision is an active, dynamic process subserving the larger cognitive goals of the organism that the visual system is embedded in. In particular, the dynamics of a cortical module at a given hierarchical level is shown, using the minimum description length (MDL) principle (Rissanen 1989; Zemel, 1994), to assume the form of an extended Kalman filter (Kalman, 1960; Kalman & Bucy, 1961; Maybeck, 1979), which optimally estimates current recognition state by combining information from input-driven bottom-up signals and expectation-driven top-down signals. The dynamics also allows a nonlinear prediction step that facilitates the processing of time-varying stimuli and helps to counteract the signal propagation delays introduced by the different hierarchical levels of the network. Prediction is mediated by a set of units whose weights can be adapted in order to minimize the prediction error. The transformation
Model Predicts Neural Response Properties
723
of signals from lower, less abstract areas to higher, more abstract areas is achieved via synaptic weights in the feedforward pathway while the transformation back is achieved via the feedback weights, both of which are learned during exposure to natural stimuli. This learning is mediated by a Hebbian learning rule also derived from the MDL principle. The weights are shown to approach the transposes of each other, which allows the fast dynamics of the Kalman filter to be efficiently implemented. The combined prediction-learning scheme can be viewed as implementing a form of the expectation-maximization (EM) algorithm (Baum, Petrie, Soules, & Weiss, 1970; Dempster, Laird, & Rubin, 1977). We first verify the hypothesis that reciprocal connections between different hierarchical levels can mediate robust visual recognition by assessing the model’s performance on a small training set of realistic objects. The model is shown to recognize objects from its training set even in the presence of partial occlusions and maintain a fine discrimination ability between highly similar visual stimuli. By exposing the network to natural images, we show that the receptive fields developed at the lowest two hierarchical levels are comparable to those of cells at the corresponding levels in the primate visual cortex. We then describe simulations of the model in both free viewing and fixating conditions and show that the simulated neural responses are qualitatively similar to those observed in the neurophysiological experiments of (Gallant et al., 1994, 1995; Gallant, 1996). We conclude by describing related work on cortical modeling, Kalman filters, and learning-parameter estimation techniques (including the relationship to the EM algorithm) and discuss the use of the model as a computational substrate for understanding a number of well-known psychophysical and physiological phenomena such as illusory contours, bistable percepts, amodal completions, visual imagery, end stopping, and other effects from beyond the classical receptive field. 2 A Simple Input Encoding Example We begin by examining a simple input encoding problem. In particular, consider the problem of encoding a collection of inputs I1 , I2 , . . . using a single-layer recurrent network (see Figure 1) with linear activation units where the goal is to minimize the reconstruction error. In other words, for a given n×1 input vector I, we would like to minimize the sum of componentwise squared errors as given by J(U, r) = (I − Ur)T (I − Ur),
(2.1)
where U denotes the n × k feedback matrix, the rows of which correspond to the synaptic weights of units in the feedback pathway, r denotes the current k×1 activation (or internal state) vector, and T denotes the transpose operator. The vector Ur corresponds to the network’s reconstruction I0 of the input I.
724
Rajesh P. N. Rao and Dana H. Ballard
Input Image
Feedforward Weights
W
r
I
State Vector
U Feedback Weights
Figure 1: A simple input encoder network. The goal is optimally to encode b and b b W, inputs I by finding best estimates U, r for the network parameters U, W, and r. Optimality is defined in terms of least squared reconstruction error where the reconstruction of the input is given by I0 = Ur.
Note that the optimization function J is modulated by two sets of parameters U and r. We can thus minimize J by performing gradient descent with respect to both of these parameters. For a given value of U, we can obtain the optimal estimate rˆ of the state vector r according to k1 ∂ J b = k1 UT (I − Ub r ), r˙ = − 2 ∂r
(2.2)
where k1 > 0 determines the adaptation rate for r. Equation 2.2 can be equivalently expressed in its discrete form as b r (t)). r(t + 1) = b r(t) + k1 UT (I − Ub
(2.3)
b of Similarly, for a given value of r, we can obtain the optimal estimate U the generative weight matrix U according to ˙ = − k2 ∂ J = k (I − Ur)r b T, b U 2 2 ∂U
(2.4)
where k2 > 0 determines the learning rate. Equivalently, T b b + 1) = U(t) b + k2 (I − U(t)r)r . U(t
(2.5)
It is interesting to note that the above rule can be regarded as a form of Hebbian learning, with presynaptic activity r and postsynaptic activity
Model Predicts Neural Response Properties
725
b The difference in the latter term can be computed, for instance, via (I − Ur). inhibitory feedback. Typically, for a given input I during learning, r = b r is b are updated. allowed to converge to a stabilized value before the weights U b are subsequently used for computing the stateb These new values for U r for the next input. The resulting scheme can thus be regarded as implementing an online form of the well-known EM algorithm (Dempster, Laird, & Rubin, 1977) (see section 9.5 for a discussion). Note that for appropriately small positive values of k1 , the quadratic form of J ensures convergence of b r to the global minimum of J for a given input I. An important feature of the above estimation scheme is that the dynamics in equation 2.3 can be readily implemented in the network of Figure 1 if we make the feedforward weight matrix W = UT . By using a learning rule for U and W T identical to equation 2.5 but with a linear weight decay term, it b converges approximately to U bT even when initialized to can shown that W arbitrary random values (see, for example, Williams, 1985). A number of variants of the estimation scheme sketched above have previously appeared in the literature in various guises. For example, Daugman (1988) proposes a gradient descent rule for estimating the coefficients of a fixed basis set of nonorthogonal Gabor filters (see also Pece, 1992). Daugman’s scheme can be seen to be essentially identical to equation 2.2 where the rows of UT comprise the Gabor filter set. On the other hand, rather than fixing the basis set of filters to be Gabor functions, one can attempt to learn the basis set for some fixed state vector r. In particular, if we let r = UT I (the feedforward response assuming W = UT ), equation 2.4 reduces to Oja’s (1989) subspace network learning algorithm and Williams’s (1985) symmetric error-correction learning rule, both of which were shown to generate orthogonal weight vectors that span the principal subspace (or dominant eigenvector subspace) of the input distribution. Thus, the purely feedforward form of the network performs an operation equivalent to principal component analysis (PCA) (Chatfield & Collins, 1980). More recently, learning schemes similar to the one we presented above have been proposed independently by Olshausen and Field (1996) and Harpur and Prager (1996). As above, these schemes combine the dynamics of the state vector r with the learning of the weights U. By adding an activity penalty (derived from a “sparseness function”) to the dynamics of the state vector r, Olshausen and Field (1996) demonstrate the development of nonorthogonal and localized basis filters from natural images. In the following sections, we show that the above framework can be regarded as a special case of a more general estimation and learning scheme that has its roots in stochastic modeling and Kalman filter theory. In particular, using a hierarchical stochastic model of the image generation process, we formulate an optimization function based on the MDL principle. In the MDL formulation, the activity penalty of Olshausen and Field (1996) arises as a direct consequence of the assumed model prior. As we show in
726
Rajesh P. N. Rao and Dana H. Ballard
the following sections, the resulting learning-estimation scheme allows the modeling of the visual cortex as a hierarchical Kalman predictor. 3 The Architecture of the Model The model is a hierarchically organized neural network wherein each intermediate level of the hierarchy receives bottom-up information from the preceding level and top-down information from a higher level (see Figure 2). At each level, the output signals of spatially adjacent modules, each processing its own local image patch, are combined and fed as input to the next higher level, whose outputs are in turn combined with those of its neighbors and fed into yet another higher level, until the entire visual field has been accounted for. As a result, the receptive fields of units become progressively larger as one ascends the hierarchical network in a manner similar to that observed in the occipitotemporal pathway (V1 → V2 → V4 → IT) of the primate visual system (Van Essen & Maunsell, 1983; Desimone & Ungerleider, 1989). At the same time, feedback pathways allow top-down influences to modulate the output of noisy lower-level modules. From a computational perspective, such an arrangement allows a hierarchy of abstract internal representations to be learned while simultaneously endowing the system with properties essential for dynamic recognition, such as the ability to form pattern completions during occlusions. From a neuroanatomical perspective, such an architecture falls well within the known complexity of cortico-cortical and intracortical connections found in the primate visual cortex (Rockland & Pandya, 1979; Gilbert & Wiesel, 1981; Lund, 1981; Jones, 1981; Douglas, Martin, & Whitteridge, 1989; Felleman & Van Essen, 1991). At the lowest level, the input image is processed in parallel by a large number of identical network modules that tessellate the visual field. For a given module, the local image patch can be considered an n-dimensional vector I, which is linearly filtered by a set of k modifiable filters as represented by the rows of a k×n feedforward matrix W of the synaptic weights of units. The synaptic weights comprising W are initialized to small random values and adapted in response to input stimuli according to the synaptic learning rules derived below. The image representation I in our model roughly corresponds to the representation at the lateral geniculate nucleus (LGN) while the filters defined by the rows of weight matrix W perform an operation similar to that carried out by simple cells in layer IV of V1. The filtered image is defined by a response vector y given by y = WI .
(3.1)
The response vector y is fed into k (possibly nonlinear) “state” units, which compute running estimatesb r of current internal state and predict the state r for the next time instant using their synaptic weights V (see Figure 2). There also exist n backprojection or feedback units with an associated n × k matrix
Model Predicts Neural Response Properties
Retinal Imaging Array
727
From Neighboring Modules -1 Σbu
W
(I - I td)
Top-Down Prediction
Level 0
-1 Σ td ( rtd - r )
-1 ( -1 Σ td
Σtd
r-r ) td
W
Σtd -1
P
(r-r )
^r
I I td = U r
P
U
td
V
f
Predicted State r
r td = U r To Neighboring Modules
Level 1
^r
r U
V
f
r
Predicted State r
Level 2
Figure 2: The architecture of the model. The feedforward pathways for the first two levels are shown in the top half of the figure, while the bottom half represents the feedback pathways. Each feedforward weight matrix W is approximately the transpose of its corresponding feedback matrix U. The feedback pathways carry top-down reconstructed signals rtd that serve as predictions for the lowerlevel modules. The feedforward pathways carry residuals between the current state and the top-down predictions, and these residuals are filtered through the feedforward weight matrix W at the next level. The current state prediction r(t) is computed from the previous state estimate b r(t − 1) using the prediction weights V and the activation function f . The current state estimate b r(t) at each level is continually updated by combining the top-down and bottom-up residuals with the current state prediction according to the extended Kalman filter update equations derived in the text. The figure illustrates the architecture for the simple case of four level 1 modules feeding into a level 2 module. However, this arrangement can be easily generalized in a recursive manner to the case of an arbitrary number of lower-level modules feeding into a higher-level module, whose outputs are in turn combined with those of its neighbors and fed into yet another higher-level and so on, until the entire visual field has been covered.
of “generative” weights U, which allow reconstruction of the input at the lower, less abstract level from the current internal state at the higher, more abstract level. Their output is given by Itd = Ur .
(3.2)
In addition to bottom-up input I, each level in the hierarchical network (except the highest level) also receives a top-down signal rtd from a higherlevel module. The signal rtd acts as a prediction of the lower-level module’s current state vector r, as estimated by a higher-level module after taking into account, for example, information from a larger spatial neighborhood in conjunction with the past history of inputs. Although the above equations were defined for the first hierarchical level, where the bottom-up input consisted of the image I(t), the same analysis can be applied to all levels of the network where the bottom-up input is simply
728
Rajesh P. N. Rao and Dana H. Ballard
the vector obtained by combining the output vectors of several neighboring lower-level modules, as illustrated in Figure 2 for the second level. Given the above architectural setting, we are now left with the problem of prescribing the dynamics for the internal state vectors b r and r, as well as appropriate learning rules for the synaptic weights U, V, and W. In the previous section, we achieved our objective by simply minimizing the reconstruction error. As we shall see, a more general statistical approach is to model the inputs to the system as being stochastically generated by a hierarchical image generation process that can be characterized as follows. First, we assume that images are being generated by a linear imaging model of the form I(t) = U(t)r(t) + nbu (t),
(3.3)
where U is a generative matrix (corresponding to the generative feedback weights in the network) and r is a hidden state vector. We assume the mean of the bottom-up noise process E(nbu (t)) = 0. The corresponding noise covariance matrix 6bu at time t is given by E[nbu (t)nbu (s)T ] = 6bu (t)δ(t, s),
(3.4)
where δ is the Kronecker delta function equaling 1 if t = s and 0 otherwise. We model the difference between the top-down prediction rtd from a higher level and the actual state vector r at the current level by another stochastic noise process ntd , r(t) = rtd (t) + ntd (t),
(3.5)
with E(ntd (t)) = 0. The top-down noise covariance matrix 6td at time t is given by E[ntd (t)ntd (s)T ] = 6td (t)δ(t, s).
(3.6)
Given the current state r(t), the transition to the state r(t + 1) at the next time instant is modeled as r(t + 1) = f (V(t)r(t)) + n(t),
(3.7)
where f is a (possibly nonlinear) vector-valued activation function, V is the state transition matrix (corresponding to the “prediction” weights in the network), and n is a stochastic noise process with mean and covariance: E[n(t)] = n(t). T
E[(n(t) − n(t))(n(s) − n(s)) ] = 6(t)δ(t, s).
(3.8) (3.9)
Model Predicts Neural Response Properties
729
For the derivations below, we assume that the three stochastic noise processes introduced above are uncorrelated over time with each other and with the initial state vector. It is convenient to view the n × k generative weight matrix U and the k × k prediction weight matrix V as an nk × 1 vector u and a k2 × 1 vector v, respectively, where the vectors are obtained by collapsing the rows of the matrix into a single vector—for example, T U1 T U2 (3.10) u= .. . UnT where Ui denotes the ith row of U. We characterize the evolution of the different unknown weights we used in the hierarchical imaging model above by the following dynamic equations, u(t + 1) = u(t) + nu (t)
(3.11)
v(t + 1) = v(t) + nv (t)
(3.12)
where nu and nv are stochastic noise processes with mean and covariances given by: E[nu (t)] = nu (t).
(3.13)
E[nv (t)] = nv (t).
(3.14)
E[(nu (t) − nu (t))(nu (s) − nu (s))T ] = 6u (t)δ(t, s). T
E[(nv (t) − nv (t))(nv (s) − nv (s)) ] = 6v (t)δ(t, s).
(3.15) (3.16)
In summary, equations 3.3 through 3.16 reflect our assumed hierarchical model of how visual inputs are being stochastically generated. The goal of a given cortical module is thus to estimate, as closely as possible, its counterpart in the input generation model as characterized by the system of equations above.1 It does so by minimizing a cost function based on the MDL principle. 4 The MDL-Based Optimization Function In order to prescribe the dynamics of the network and derive the learning rules for modifying the synaptic weight matrices, we make use of the 1 By convention, each estimate that the cortical module maintains will be differentiated from its counterpart in the input generation model by the occurrence of the hat symbol. For example, the mean of the current cortical estimate of the actual state vector r will be denoted byb r.
730
Rajesh P. N. Rao and Dana H. Ballard
MDL principle (Rissanen, 1989; Zemel, 1994), a formal information-theoretic version of the well-known Occam’s Razor principle commonly attributed to William of Ockham (thirteenth–fourteenth century A.D.) (Li & Vitanyi, 1993). Briefly, the goal is to encode input data in a way that balances the cost of encoding the data given the use of a model with the cost of specifying the model itself. The underlying motivation behind such an approach is to fit the model accurately to the input data while at the same time avoiding overfitting and allowing generalization. Suppose that we have already computed a prediction r of the current state r based on prior data. In particular, let r be the mean of the current state vector before measurement of the input data at the current time instant. The corresponding covariance matrix is given by E[(r − r)(r − r)T ] = M.
(4.1)
Similarly, let u and v be estimates of the current weights u and v calculated from prior data with E[(u − u)(u − u)T ] = S. T
E[(v − v)(v − v) ] = T .
(4.2) (4.3)
Given a description language L, data D, and model parameters M, the MDL principle advocates minimizing the following cost function (Zemel, 1994): J(M, D) = |L(M, D)| = |L(D|M)| + |L(M)|.
(4.4)
| · | denotes length of the description. In our case, D consists of the current input image I(t) and top-down input rtd (t), and M consists of the parameters U, V, and r that can be modulated.2 Thus, |L(D|M)| = |L(I − Ur)| + |L(r − rtd )|.
(4.5)
The two terms on the right-hand side of the preceding equation are simply the bottom-up and top-down reconstruction errors. The cost of the model is given by: |L(M)| = |L(r)| + |L(u)| + |L(v)|.
(4.6)
In a Bayesian framework, equation 4.6 can be regarded as taking into account the prior model parameter distributions, while equation 4.5 represents the 2 The feedforward weight matrix W does not appear among the model terms since this matrix can be constrained to be the transpose of the feedback weight matrix U as a result of the synaptic learning rules described in section 6.
Model Predicts Neural Response Properties
731
negative log likelihood of the data given the model parameters. Minimizing the complete cost function J is thus equivalent to maximizing the posterior probability of the model M given data D. Given the true probability distribution (over discrete events) of the various terms in the above equations, the expected length of the optimal code for each term is given by Shannon’s optimal coding theorem (Shannon, 1948), |L(x)| = − log P(X = x),
(4.7)
where P(X = x) denotes the probability of the discrete event x. Since the true distributions are unknown, we appeal to the central limit theorem (Feller, 1968) and use multivariate gaussians as the prior distributions for coding the various terms above. However, encoding from a continuous distribution requires the calculation of the probability mass of a particular small interval of values around a given value for use in equation 4.7 (Nowlan & Hinton, 1992; Zemel, 1994). Using a trapezoidal approximation, we may estimate the mass under a continuous (in our case, gaussian) density p in an interval of width w around a value x to be P(X = x) ∼ = p(x)w. For encoding the errors (see equation 4.5), we assume w to be a constant infinitesimal width that yields (using equation 4.7 and ignoring the constant terms due to the coefficients of the multivariate gaussians), −1 −1 (I − Ur) + (r − rtd )T 6td (r − rtd ). |L(D|M)| = (I − Ur)T 6bu
(4.8)
For encoding the model terms (see equation 4.6), a constant infinitesimal width w may be inappropriate since some values of the parameters may need to be encoded more accurately than others. For example, the network may be more sensitive to small changes in some parameter values than others (Nowlan & Hinton, 1992). One solution, suggested by Nowlan and Hinton (1992), is to make w inversely proportional to the curvature of the cost function surface, but this may be computationally very expensive. A simpler alternative, which favors small values for the parameters given the choice between small and large values, is to let w be a zero mean radially symmetric gaussian for each of the model terms u, v, and r with variances inversely proportional to α, β, and γ , respectively. Using equation 4.7, the model cost then reduces to (ignoring the constant terms) |L(M)| = (r − r)T M−1 (r − r) + (u − u)T S−1 (u − u) + (v − v)T T−1 (v − v) + γ rT r + αuT u + βvT v.
(4.9)
The first three terms in the sum above arise from the prior gaussian densities for r, u, and v as given by G(r, M), G(u, S), and G(v, T), while the latter three terms are derived from the zero mean gaussians associated with w. Note that the above equation implements an intuitively appealing trade-off between achieving a good fit with respect to prior predictions (r, u, and v) and
732
Rajesh P. N. Rao and Dana H. Ballard
simultaneously penalizing large values for the parameters (activities r and weights u and v). One may also view the latter three terms as regularizers (Poggio, Torre, & Koch, 1985) that help prevent overfitting of data, thereby increasing the potential for generalization.3 5 Network Dynamics and Optimal Estimation of Recognition State The derivation of the estimation algorithm for recognition state is analogous to that for the discrete Kalman filter as given in standard texts such as Bryson and Ho (1975) (the continuous case follows in a straightforward fashion by applying a few limits to the difference equations in the discrete case; Bryson & Ho, 1975). Given new input I and top-down prediction rtd , the optimal estimate of current state r after measurement of the new data is a stochastic code whose meanb r and covariance P can be obtained by setting ∂ J(M, D) = 0, ∂r
(5.1)
which implies −1 −1 r − r) − UT 6bu (I − Ub r ) − 6td (rtd −b r ) + γb r = 0. M−1 (b
(5.2)
After some algebraic manipulation, equation 5.2 yields the following update rule for the mean of the optimal stochastic code at time t, −1 −1 b (I − Ur(t)) + P6td (rtd − r(t)) − γ Pr(t), r(t) = r(t) + PUT 6bu
(5.3)
where the prediction r is given by r(t + 1) = f (V(t)b r(t)) + n(t).
(5.4)
It follows that the covariance matrix P at time t is given by the update rule −1 −1 (t)U + 6td (t) + γ I)−1 , P(t) = (M−1 (t) + UT 6bu
(5.5)
where I is the k × k identity matrix. The prediction error covariance matrix M is updated according to M(t + 1) =
µ ¶T ∂f ∂f P(t) + 6(t), ∂r ∂r
(5.6)
3 The variance-related regularization parameters α, β, and γ may be chosen according to a variety of techniques (see Girosi, Jones, & Poggio, 1995, for references) but for the experiments in this article, we used appropriately small but arbitrary fixed values for these parameters.
Model Predicts Neural Response Properties
733
with the partial derivative being evaluated at r = b r(t). The partial derivatives arise from a first-order Taylor series approximation to the activation function f aroundb r(t). The above set of equations can be regarded as implementing a form of the extended Kalman filter (Maybeck, 1979). During implementation of the algorithm, the covariance matrices 6, 6bu , b 6 bbu , and 6 btd estimated from and 6td can be approximated by matrices 6, sample data (see Rao (1996) for further details). Further, by appealing to a version of the EM algorithm (see section 9.5), we may use U = U and V = V in the equations above, where U and V are learned estimates whose update rules are described in section 6 and the appendix, respectively. For neural implementation, the update equations above can be simplified considerably by noting that the basis filters that form the rows of the feedforward matrix T W (∼ = U ; see the learning rule below) also approximately decorrelate their input, thereby effectively diagonalizing the noise covariance matrices (see also Pentland, 1992, for related ideas). This allows the recursive update rules to be implemented locally in an efficient manner. Equation 5.3 forms the heart of the dynamic state estimation algorithm. Note that equation 2.3 which was derived via gradient descent in section 2, is actually a special case of equation 5.3. More importantly, equation 5.3 implements an efficient trade-off between information from three different sources: the system prediction r(t), the top-down prediction rtd , and the bottom-up data I. Intuitively, the bottom-up and top-down gain matrices T −1 b −1 can be interpreted as signal-to-noise ratios. Thus, when b and P6 PU 6 bu td bbu is high (for instance, due to occlusions), the bottom-up noise variance 6 the bottom-up term (I − Ur(t)) is given less weight in the state estimation step (due to a lower gain matrix) and the estimate b r(t) relies more on the top-down term (rtd − r(t)) and the system prediction r(t). On the other hand, btd is high (for instance, due to ambiguity when the top-down noise variance 6 in interpretation by the higher-level modules), the estimateb r(t) relies more heavily on the bottom-up term (I − Ur(t)) and the system prediction r(t). Finally, if the system prediction r(t) has a large noise variance, the matrix P assumes larger values, which implies that the system relies more heavily on the top-down and bottom-up input data rather than on its noisy prediction r(t). The dynamics of the network thus strives to achieve a delicate balance between the current prediction and the inputs from various cortical sources by exploiting the signal-to-noise characteristics of the corresponding input channels. Note that by keeping track of the degree of correlations between units at any given level, the covariance matrices also dictate the degree of lateral interactions between the units as determined by equation 5.3. In summary, the optimal estimate of the current state before measurement of current data is a stochastic code with mean r and covariance M. After measurement and update as given by equations 5.3 and 5.5, the mean and covariance becomeb r and P, respectively. The meanb r is updated by filtering the current residuals or “innovations” (Maybeck, 1979) (I − Ur) and
734
Rajesh P. N. Rao and Dana H. Ballard
T −1 b −1 , respectively. Note that this requires the b and P6 (rtd − r) using PU 6 bu td T
existence of the matrix U for filtering the bottom-up residual (I−Ur) that is being communicated from the lower level. This computation can be readily implemented if the feedforward weights W happen to be symmetric with T
the feedback weights U, that is, W = U . Therefore, we need to address the question of how such feedforward and feedback weights can be developed. 6 Learning the Feedforward and Feedback Synaptic Weights The previous section described one method of minimizing the optimization r function J(M, D): updating the estimate of the current state r as given byb and P in response to the input data. However, J can be further minimized by additionally adapting the weights u and v, and the feedforward weights w as well. Here, we derive the learning rules for u and w. A possible learning procedure for the prediction weights v is described in the appendix. Note that in order to achieve stability in the estimation of state r, the weights need to be adapted at a much slower rate than the dynamic process that is estimating r. We first derive the weight update rule for estimating the backprojection (generative) weights u. Notice that (I − Ur) = (I − Ru),
(6.1)
where R is the n × nk matrix given by R=
rT 0 .. . 0
0 rT .. . ...
... ... .. . 0
0 0 .. . rT
.
(6.2)
Differentiating J with respect to the backprojection weights u and setting the result to 0, ∂ J(M, D) = 0, ∂u
(6.3)
we obtain, after some algebraic manipulation, the following update rule for the mean of the optimal stochastic weight vector at time t (note that we use r = r(t) as prescribed by a version of the EM algorithm; see section 9.5), −1 b (I − R(t)u(t)) − αPu u(t), u(t) = u(t) + Pu R(t)T 6bu
(6.4)
where u(t + 1) = b u(t) + nu (t) and R(t) is the matrix obtained by replacing r with r(t) in the definition of R above. The covariance matrices are updated
Model Predicts Neural Response Properties
735
according to −1 (t)R(t) + αI)−1 Pu (t) = (S−1 (t) + R(t)T 6bu
S(t + 1) = Pu (t) + 6u (t),
(6.5) (6.6)
where I is the nk × nk identity matrix. Note that equation 6.4 is simply a Hebbian learning rule with decay, the presynaptic activity being the current responses R(t) and the postsynaptic activity being the residual (I − R(t)u(t)) (see the feedback pathway in Fig. 2). It is relatively easy to see that equation 2.5, which we derived via gradient descent in section 2, is in fact a special case of equation 6.4 above. Thus, as we mentioned in section 2, for the case of feedforward processing, where r = WI T
and U = W , this learning rule (without the covariance matrices) becomes identical to Williams’s (1985) symmetric error-correction learning rule and Oja’s (1989) subspace network learning algorithm, both of which perform an operation equivalent to PCA (Chatfield & Collins, 1980). The more general form above that additionally incorporates dynamics for the state vector allows the development of nonorthogonal and local representations as well. Another interesting feature of the above learning rule is the transition from b u(t) to u(t + 1). This step allows the modeling of intrinsic neuronal noise via the term nu (t). By assuming a distribution for this noise term (for example, a zero-mean gaussian distribution) and sampling from this distribution, the transition from b u(t) to u(t + 1) can be made stochastic. This helps the learning algorithm avoid local minima and facilitates the search for a global minimum. The weight decay term −αPu u(t) (due to the model term |L(u)| in the MDL optimization function) penalizes overfitting of the data and helps increase the potential for generalization. Having specified a learning rule for the feedback weights, we need to formulate one for the feedforward weights such that the dynamics expressed by equation 5.3 is satisfied. In other words, the feedforward weight matrix W should be updated so as to approximate the transpose of the feedback matrix U. One solution is to assume that the feedforward weights are initialized to the same random values as the transpose of the feedback weights and then to apply the rule given in equation 6.4. However, it is highly unlikely that biological neurons can have their synapses initialized in such a convenient manner. A more reasonable assumption is that the feedforward weights are initialized independent of the feedback weights. Let w represent the vectorized form of the matrix W T analogous to equation 3.10. Consider the following weight modification rule for the feedforward weights at time t, −1 b (t) = w(t) + Pu R(t)T 6bu (I − R(t)u(t)) − αPu w(t), w
(6.7)
b (t) + nw (t) with nw (t). This is again a form where we use w(t + 1) = w of Hebbian learning (with decay), the presynaptic activity this time being
736
Rajesh P. N. Rao and Dana H. Ballard
the residual image and the postsynaptic activity being the current r (see the feedforward pathway in Figure 2). It is relatively easy to show that the above rule causes the feedforward matrix to approach the transpose of the ∼ UT , even when they start from different initial feedback matrix, that is, W = conditions. Define a difference vector d as b (t). d =b u(t) − w
(6.8)
Then, from equations 6.4 and 6.7, d = (u(t) − w(t)) − αPu (u(t) − w(t)).
(6.9)
b (t), (u(t) − w(t)) → d and u(t) and w(t) → w Asymptotically, as u(t) → b equation 6.9 reduces to −αPu d ∼ = 0.
(6.10)
Thus, given that α > 0 and Pu is positive definite, the expected value of d approaches zero asymptotically, which implies u ∼ = w. In other words (since T T ∼ w is W by definition), W = U , which satisfies the condition needed for the dynamics of the network (see equation 5.3) to converge to the optimal response vector for a given input. 7 Experimental Results Before presenting simulation results for the free viewing and fixation experiments, we verify the viability of the model by describing results from two related experiments. 7.1 Experiment 1: Learning and Recognizing Simple Objects. The first experiment was designed to test the model’s performance in a simple recognition task. A hierarchical network as shown in Figure 2 was used with four level 1 modules, each processing its local image patch, feeding into a single level 2 module. The four level 1 feedforward modules each contained 10 units (rows of W), and the second-level module contained 25. The constants α and γ that determine the variances of the MDL model terms were set at 0.02 and 0.1, respectively. The feedforward and feedback matrices were trained by exposing the network to gray scale images of five objects as shown in Figure 3A. Each input image was of size 128 × 128 and was split into four equal subimages of size 64 × 64 that were fed to the four level 1 modules. The average gray-level values were subtracted from each image and the resulting image normalized to unit length before being input to the network. For each input training image, the entire network was allowed to stabilize before the weights were updated. Since only static images were
Model Predicts Neural Response Properties
737
used, we used r(t + 1) = b r(t). Note that even without a nonlinear prediction step, for an arbitrary set of possibly nonorthogonal weight vectors, the optimal response of the network at a given hierarchical level remains a highly nonlinear function of the input image that cannot be computed by a single feedforward pass but rather must be recursively estimated by using equation 5.3 until stabilization is achieved (cf. Daugman, 1988). Figures 3B through E illustrates the response of the trained network to various input images. 7.2 Experiment 2: Learning Receptive Fields from Natural Images. The previous section demonstrated the model’s ability to learn internal representations for various man-made objects and subsequently use these representations for robust recognition. In this section, we ask: What internal representations does the network develop if it is exposed to a sequence of arbitrary natural images? This question becomes especially relevant if the network is to be used as a model of the visual cortical processing during, for example, free viewing of natural images, and therefore the internal representations (such as the weights W) developed by the model will need to be at least qualitatively similar to corresponding representations in the cortex that have been reported in the literature. We once again employed a hierarchical network as shown in Figure 2 with four level 1 modules feeding into a single level 2 module. The four level 1 feedforward modules each contained 20 units (rows of W), and the level 2 module contained 25 rows. Eight gray-scale images of natural outdoor scenes were used for training the network. However, unlike the previous experiment, the receptive fields of the four level 1 modules were allowed to overlap as shown at the right of Figure 4, in order to model qualitatively the overlapping organization of receptive fields of neighboring cells in the cortex (Hubel, 1988). Thus, for each natural image, a 20 × 20 image patch was chosen, and four overlapping 16 × 16 subimages, offset by four pixels horizontally and/or vertically from each other, were fed as input to the lower-level modules. Each of the subimages was preprocessed by subtracting the average gray-level value from each pixel to approximate the removal of DC bias at the level of the LGN. The resulting subimage was windowed by a 16 × 16 gaussian in order to simulate a gaussian input dendritic field. A total of 12,114 image patches were used for training the hierarchical network. For each image patch, the network was allowed to converge to the different optimal response vectorsb r for the different modules at each level T
before the corresponding weights (W = U ) were updated. The variancedependent constants α and γ were set to 0.02 and 0.1, respectively, and r(t), as in the previous the temporal prediction step was simply r(t + 1) = b section. Figure 4 shows feedforward synaptic weights W for the level 1 (equivalent to V1) and level 2 (V2) units that were learned using equation 6.7. The
738
Rajesh P. N. Rao and Dana H. Ballard
A Input to Network
Top-Down Prediction
Residual
B
C
D
E
Figure 3: Results from experiment 1: Recognition of simple objects. (A) The five objects used for training the hierarchical network. The first image additionally shows how a given image was partitioned into four local subimages that were fed to the four corresponding level 1 modules. (B) through (E) illustrate the response of the trained network to various input images. (B) When a training image is input, the network predicts an almost perfect reconstructed image, resulting in a residual that is almost everywhere zero, which indicates correct recognition. (C) If a partially occluded object from the training set is input, the unoccluded portions of the image together contribute, via the level 2 module, to predict and fill in the missing portions of the input. (D) When the network is presented with an object that is highly similar to a trained object, the prediction is that of the closest resembling object (the car in the training set). However, the large residual allows the network to judge the input as a new object rather than classifying it as the training object and incurring a false-positive error, a common occurrence in most purely feedforward systems. (E) shows that a completely novel object results in a prediction image that is an arbitrary mixture of the training images and as such, generates large residuals. The system can then choose to learn the new object using equations 6.4 and 6.7, or ignore the object if it happens to be behaviorally irrelevant.
Model Predicts Neural Response Properties
739
A
B
C
Figure 4: Results from experiment 2: Learning receptive fields from natural images. (A) Three of the eight images used in training the weights. The boxes on the right show the size of the first- and second-level receptive fields, respectively, on the same scale as the images. Four overlapping gaussian windowed image patches extracted from the natural image were fed to the level 1 modules, shown at the extreme left in Figure 2. The responses from the four level 1 modules were in turn combined in a single vector and fed to the level 2 module. The effective level 2 receptive field thus encompasses the image region spanned by the four overlapping level 1 receptive fields as shown on the right in (C). (B) Ten of the 20 receptive fields (or spatial filters) as given by the rows of W in a level 1 module. These resemble classical oriented edge-bar detectors, which have been previously modeled as difference of offset gaussians or Gabor functions. (C) Six of 20 receptive fields as given by the rows of W at level 2.
level 1 receptive fields resemble nonorthogonal wavelet-like edge-bar detectors at different orientations similar to the receptive fields of simple cells in V1 (Hubel & Wiesel, 1962, 1968; Palmer, Jones, & Stepnoski, 1991; see also Olshausen & Field, 1996; Harpur & Prager, 1996; Bell & Sejnowski, in press). These have previously been modeled as two-dimensional Gabor functions (Daugman, 1980; Marcelja, 1980) or difference of gaussian operators (Young, 1985). The level 2 receptive fields were constructed by propagating a unit impulse response for each row of the weight matrix W at the second level to level 0. These receptive fields appear to be tuned toward more complex shapes such as corners, curves, and localized edges. The existence of cells coding for features more complex than simple edges and bars have also been reported in several electrophysiological studies of V2 (Hubel & Wiesel, 1965;
740
Rajesh P. N. Rao and Dana H. Ballard
Baizer et al., 1977; Zeki, 1978). Thus, both the first- and second-level units in the model develop internal representations that are comparable to those of their counterparts in the primate visual cortex. 8 Simulations of Free Viewing and Fixation Experiments The experimental results from the previous section suggest that the model may provide a useful platform for studying cortical function by demonstrating the model’s efficacy in a simple visual recognition task and showing that the learning rules employed by the model are capable of developing receptive fields that resemble cortical cell receptive field weighting profiles. In this section, we return to the question posed in the introduction: Why do the responses of some visual cortical neurons differ so drastically when the same image regions enter the neuron’s receptive field in free viewing and fixation conditions (Gallant et al., 1994, 1995; Gallant, 1996)? A possible answer is suggested by the recognition results in Figure 3C. In particular, the figure shows that the responses of neurons depend not only on the image in the cell’s receptive field but also on the images in the receptive fields of neighboring cells. This suggests that a cell’s response may be modulated to a considerable extent by the surrounding context in which the cell’s stimulus appears, due to the presence of top-down feedback in conjunction with lateral interactions mediated by the covariance matrices b −1 (see equation 5.3). b −1 , and 6 P, 6 bu td For comparison with Gallant et al.’s (1995; Gallant, 1996) electrophysiological recording results, we assumed that the cell responses recorded correspond to the residual (r − rtd ) in the model (see Fig. 2).4 Qualitatively similar results are obtained by measuring the filtered bottom-up residual. For the simulation experiments, we modeled cells in V1 and V2 as shown in Figure 2 (with four level 1 modules feeding into a single level 2 module), but
4 The primate visual cortex is a laminar structure that can be divided into six layers (Felleman & Van Essen, 1991). The supragranular pyramidal cells (e.g., layer III) typically project to layer IV of the next higher visual area, while infragranular pyramidals (e.g., layer V/VI) send axons back to layers I, V, and/or VI of the preceding lower area (Van Essen, 1985; Felleman & Van Essen, 1991). The architecture of the model (see Fig. 2) suggests that the supragranular cells may carry the residual (r − rtd ) to the next level, while the infragranular cell axons could convey the current top-down prediction Ur (= Itd or rtd ) to the lower level. The bottom-up residual (I − Ur) would then be filtered by layer IV cells (whose synapses encode W) with the help of interneurons, which implement the lateral r(t) presumably would be computed by interactions due to the covariances. The estimateb infragranular cells (whose synapses could encode V) during the course of predicting r(t + 1); the presence of recurrent axon collaterals (Lund, 1981) among these cells is especially suggestive of such a recursive computation. These functional interpretations of cortical circuitry are, however, speculative and require rigorous experimental confirmation.
Model Predicts Neural Response Properties
741
the model can be readily extended to more abstract higher visual areas with a higher degree of convergence at each level. The parameters were set to the same values as in section 7.2. The feedforward and feedback weights were set equal to the values learned during prior exposure to natural images as described in section 7.2. The free viewing experiments were simulated by making random eye movements to image patches within a given natural image and allowing the entire network to converge until all cell responses stabilized; the response values (as given by the residual) were then recorded. For determining the response of a cell during the fixation task, only the image region that fell in the cell’s receptive field during free viewing was displayed and the cell’s response noted. The cell responses recorded were numerical quantities that could be used to generate spikes in a more detailed neuron model. Since the eye movements were assumed to be random, the temporal prediction step involving V was not used in these simulations, but we nevertheless expect it to play an important role in future simulations that employ temporal as well as spatial context in predicting stimuli. Figure 5 shows the simulation results. As noted above, the response of a given cell at any level is determined not just by the image falling within its receptive field but also by the images in the receptive fields of neighboring cells since the neighboring modules feed into the higher level, which conveys the appropriate top-down prediction back to the lower level: the better the top-down prediction rtd , the smaller the residual (r − rtd ) and, hence, the smaller the response. In the case of free viewing an image, the neighboring modules play an active role in contributing to the top-down prediction for any cell, and the existence of this larger spatial context allows for smaller responses in general. This is shown in Figure 5A where the given cell’s response initially increases until appropriate top-down signals cause the response to diminish to a stabilized level. On the other hand, in the fixation task, only the image subregion that fell in a cell’s receptive field during free viewing is displayed on a blank screen. The absence of any contextual information results in a much less accurate top-down prediction due to the lack of contributions from neighboring modules and, hence, a relatively larger response is elicited, as illustrated in Figure 5B. The histograms of responses in the simulations of fixation and free viewing conditions are shown in Figure 5C. The results show that in most, but not all, cases, the responses in free viewing are substantially reduced. This is consistent with the free viewing data (Gallant, 1996) and the data from accompanying control experiments (see below). Perhaps the most interesting feature of the free viewing versus fixation histograms is that they overlap, suggesting that the reduction in responses is not a binary all-or-none phenomenon but rather one that is dependent on the specific ability of the higher-level units to predict the current state at the lower level.
742
Rajesh P. N. Rao and Dana H. Ballard
Response of Simulated Neuron
0.4
0.4
Free Viewing Response
0.35
Fixation Response
0.35
0.3
0.3
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0 0
20
40
60
80
100
120
0
20
40
Time (Number of Iterations)
60
80
100
120
Time (Number of Iterations)
B
A 9000 8000
Free Viewing Responses Fixation Responses
7000
Frequency
6000 5000 4000 3000 2000 1000 0 0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Response Magnitude
C
Figure 5: Free viewing and fixation simulation results. (A) An example of the dynamic response of a cell to an image patch in the simulated case of free viewing. (B) The response of the same cell to the same image patch but under conditions of fixation. (C) The histograms of responses accumulated over 20 cells in a first-level module for 500 image presentations, showing the fixation responses (light gray) compared to the free viewing responses (black). Responses obtained during free viewing were found to be in general much diminished than were those obtained during the fixation task. As the graph depicts, more than 80 percent of the free viewing responses lie within the first bin in the histogram; the fixation distribution, on the other hand, has a much longer tail than the free viewing distribution.
Model Predicts Neural Response Properties
743
9 Discussion In related experiments (Gallant et al., 1995), it was observed that responses were sometimes also suppressed by enlarging the stimulus beyond the cell’s receptive field in the fixation task. More recently, Gallant has shown that in many cases, the response attenuation observed during free viewing can be duplicated in a static fixation task in which a review “movie” recreating the spatiotemporal sequence of image patches that entered the receptive field during free viewing is presented to the monkey, the review stimuli being three times the size of the classical receptive field (Gallant, 1996). The results from both of these experiments, which do not involve any eye movements, lend support to our explanation of the suppression of responses as occurring largely due to contributions from neighboring modules in response to the global input stimuli, rather than being the result of oculomotor predictive mechanisms (such as the “shifter circuit” of Anderson & Van Essen, 1987). However, this does not preclude the possibility of suppression due to oculomotor predictions occurring in other visual cortical areas, especially those in the parietal cortex, which are connected to oculomotor structures and where such predictive remapping effects have been observed (Duhamel, Colby, & Goldberg, 1992). A number of neurophysiological effects due to stimuli from beyond the classical receptive field can be explained within the context of the present model as occurring due to the influence of top-down predictions. Consider, for example, the classical phenomenon of end stopping, or suppression of responses when the stimulus (e.g., a line) extends beyond the cell’s receptive field (Hubel & Wiesel, 1965, 1968). This is clearly a highly “nonlinear” property of the “hypercomplex” cell (Hubel & Wiesel, 1965) if one considers the cell in isolation; however, the model suggests that such “nonlinearities” may in fact arise due to cortico-cortical, geniculo-cortical, and intracortical interactions between linear cells as well. An example of this phenomenon is Figure 3C, where a neural unit participates in the completion of an occluded image: The unit’s response appears to be a highly nonlinear function of the input image patch even though its activation function is linear. In the case of end stopping, the present model predicts that the ends are “stopped” via inhibitory feedback whenever the surrounding and encompassing receptive fields can predict the cell’s local input. The behavior of the cell appears nonlinear only if the cell is viewed in isolation without regard to possible interactions between its neighbors within a level and across levels (as suggested by the covariance terms and top-down feedback terms, respectively, in equation 5.3).5 Indeed, experiments by Murphy and Sillito (1987) confirm 5 Note that this does not preclude the existence of nonlinear neurons in the cortex. Nonlinearities (within the dynamic range of a cortical neuron) are accounted for in two ways by the present model: (1) single units in the model code both positive and negative quantities, whereas the cortex may employ multiple neurons for this purpose, thereby
744
Rajesh P. N. Rao and Dana H. Ballard
the contribution of cortical feedback in mediating end inhibition in cells in the dorsal LGN. In addition, Bolz and Gilbert (1986) have shown that supragranular cells in the cat striate cortex (V1) lose the property of end-stopping when infragranular cells (in particular, cells in layer VI) are inactivated by using the inhibitory transmitter gamma-aminobutyric acid (GABA). Since infragranular cells in V1 receive feedback from higher extrastriate visual cortical areas (Van Essen, 1985; Felleman & Van Essen, 1991), it is reasonable to assume that inactivation of these cells prevents top-down feedback from influencing the supragranular cells, thereby causing them to lose their property of end stopping. Note that other properties of the supragranular cells such as orientation selectivity were found to remain intact, which is in agreement with our model. Further evidence for the fundamental role of top-down feedback in influencing neural activity at lower cortical levels is provided by Mignard and Malpeli (1991), who show that cells in the upper layers of V1 can be driven in the absence of direct bottom-up input from the LGN. However, additionally destroying V2 profoundly reduced the upper layer neural activity, which led them to conclude that the cause of the upper layer activity was in fact top-down feedback from V2. Given the above observations, it may be possible to attribute other types of “nonspecific suppression” in V1 cells, such as cross-orientation inhibition (DeAngelis, Robson, Ohzawa, & Freeman, 1992) to bottom-up–top-down cortico-cortical and lateral interactions such as those occurring in equation 5.3 rather than to purely normalization-based suppression mechanisms (Heeger, Simoncelli, & Movshon, 1996). Also, in the light of the above discussion, the model predicts that much less response suppression should be observed in Gallant et al.’s free-viewing experiments if top-down feedback is precluded, for instance, by intracortical pharmacological means as in the experiments of Bolz and Gilbert (1986) or via destruction of higher cortical areas as in the experiments of Mignard and Malpeli (1991). Suppression of neural responses has also been observed in higher visual cortical areas such as MT (Allman et al., 1985), V4 (Desimone & Schein, 1987; Muller et al., 1996), and IT (Miller, Li, & Desimone, 1991). Many of these experiments employ time-varying stimuli such as drifting gratings, and we would therefore expect the responses to be modulated with respect to not just spatial context (as in this article) but also recent temporal context. This would necessitate learning the prediction matrix V (a possible method is suggested in the appendix) and using it in conjunction with an appropriate activation function f for predicting spatiotemporal inputs as given by equation 5.4. Simulation of such spatiotemporal cortical models constitutes an active direction of future research.
causing the appearance of a nonlinearity; (2) the model allows for neurons with nonlinear activation functions in the prediction step (equation 5.4) at each hierarchical level.
Model Predicts Neural Response Properties
745
9.1 Free Viewing of Natural Scenes by Humans. An interesting question is whether the model conforms to behavioral data obtained from human subjects during free viewing of natural images. Experiments by McConkie and others (McConkie, 1991; Grimes & McConkie, 1995) appear to be especially relevant in answering this equation. In these experiments, subjects were given the primary task of remembering an image, but they were also informed that changes might be introduced in the image as they examine it. They were asked to press a button when they detected a change. During a given trial, parts of the image was changed during certain saccades. Surprisingly, the subjects were remarkably unaware of such prominent changes as the addition or deletion of objects and changes in color. The remarkable insensitivity of human subjects to changes in the image during “free viewing” might appear to contradict the present model since, after all, a change in the image from a previous fixation should introduce a large residual and hence direct the subject’s attention to it.6 Recent experiments in our laboratory (Ballard, Hayhoe, Pook, & Rao, in press) and in McConkie’s laboratory (Irwin, McConkie, Carlson-Radvansky, & Currie, 1994) in fact do corroborate this expectation: Subjects do indeed become aware of the changes if the object that was subject to a change was relevant to the task at hand. For example, Ballard et al. (in press) report an increase in fixation time of subjects whenever task-relevant changes are introduced during the course of a block copying task. In addition, neurophysiological experiments by Miller et al. (1991) indicate that responses in IT neurons in the monkey are attenuated according to whether the presented stimulus matches a previously memorized object, yielding large responses (residuals) during mismatches and response suppression (small residuals) during matches. Indeed, the theoretical framework of Crick and Koch (see, for example, Koch, 1996) proposes that visual awareness may be correlated with responses in higher visual cortical areas such as V4 and IT, but probably not lower-level areas such as V1. This supports our argument that high residuals in lower-level areas, especially when in image regions not related to the task at hand, need not always imply awareness of the corresponding changes as in McConkie’s experiments. 9.2 Illusory Contours, Amodal Completions, Visual Imagery, and Bistable Percepts. We have shown that the ability of adjacent modules to contribute to the output of a neighbor (via feedback from the next level) endows the model with important properties, such as the ability to form completions in the presence of occlusions (for example, see Fig. 3). The same property also allows alternate interpretations of phenomena such as subjective-illusory contours and amodal completions (Kanizsa, 1990). For
6 This observation applies to models employing temporal context to predict postsaccadic stimuli, whereas the simulations described in this article used spatial context only.
746
Rajesh P. N. Rao and Dana H. Ballard
example, Grosof, Shapley, & Hawken (1992) and von der Heydt, Peterhans, and Baumgartner (1984) report the existence of neurons responding to illusory contours in V1 and V2, respectively. The present model suggests that these responses may be the result of expectation-based top-down feedback, as illustrated by Figure 3C. In this regard, the existence of overlap between adjacent receptive fields helps considerably in providing more accurate completions as compared to nonoverlapping receptive fields. The feedback pathways in the model also provide a convenient substrate for realizing visual imagery as illustrated by the top-down reconstructed images in Figure 3. Indeed, recent positron emission tomography (PET) studies have shown that imagery shares some of the same neural substrates as perception and can activate even the lowest visual areas (Kosslyn, Thompson, Kim, & Alpert, 1995). Equation 5.4 allows the effects of intrinsic neuronal noise to be modeled via the additive noise term n(t). By assuming a probability distribution for this term and sampling from this distribution in equation 5.4, the computation of the recognition state estimates can be made stochastic. An alternate but equally plausible method to make the model stochastic is to sample directly the current recognition state from a gaussian distribution with mean r and covariance M. Either of these settings conceivably could allow modeling of bistable percepts and Gestalt inversions such as Rubin’s vase or Necker’s cube. Such phenomena could be attributed to a lack of convergence at the highest level to a stable object hypothesis. The stochastic nature of the model in turn would allow periodic transitions between two alternate interpretations of the same visual image (local minima in recognition state space). Evaluating the efficacy of these ideas remains the subject of ongoing simulations.
9.3 Comparison with Related Work on Cortical Modeling. There has been considerable work in recent years on determining the computational nature of the cortex and the brain (Churchland & Sejnowski, 1992). This work includes models ranging from feedforward networks such as the hierarchical “perceptual network” of Linsker (1988) and the “HBF” networks of Poggio and collaborators (Poggio, 1990; Poggio, Fahle, & Edelman, 1992) to bidirectional “flow”–control-based models, such as Ullman’s (1994) Counterstreams model, Deacon’s (1989) “counter-current diffusion” model, and Van Essen, Anderson, and Olshausen’s (1994) Dynamic Routing Circuit model. Although an exhaustive survey is beyond the scope of this article, we briefly outline comparisons of the present model with some closely related models that employ both feedforward and feedback connections for perception and learning. Perhaps the earliest use of feedback mechanisms as integral components of perception can be found in the classic article by Pitts and McCulloch (1947) in which a negative feedback model of oculomotor control is pro-
Model Predicts Neural Response Properties
747
posed. Another early model in which feedback plays a central role is that of MacKay (1956), who describes an automaton that generates “imitative” internal responses to match incoming signals adaptively. The present model can be regarded as a concrete stochastic solution to the “epistemological problem” that is the subject of MacKay’s work. Feedback also plays an important role in several more recent theories, such as Carpenter and Grossberg’s (1987) adaptive resonance theory, Edelman’s (1978) “reentrant” signaling theory, Fukushima’s (1988) extended Neocognitron model, Harth, Unnikrishnan, and Pandya’s (1987) “Alopex” optimization model, Mumford’s (1994) model of the cortex based on Grenander’s (1976–1981) Pattern Theory, and Rolls’ (1989) theories on the hippocampus and cortical backprojections, all of which use feedback connections to instantiate forward models (Jordan & Rumelhart, 1992) of the input data but differ in the way top-down feedback is compared and integrated with bottom-up input. Of these, Mumford’s model is closest in spirit to the ideas presented in this article. However, iterative relaxation schemes such as those proposed by Mumford and others have been previously dismissed as models of perception due to the large number of iterations that may be required to ensure convergence, a fact that contradicts the rapidity of recognition of static stimuli in humans and primates (Thorpe & Imbert, 1989; Oram & Perrett, 1992; Tovee, Rolls, Treves, & Bellis, 1993). The model proposed in this article avoids the pitfalls of steepest descent relaxation algorithms by employing prediction weights V and approximate inverse models (as given by W) in the feedforward pathway that allow rapid (though perhaps crude and not always correct) one-shot estimates of static stimuli that are further refined by the state estimation algorithm if necessary. Inverse models also play a crucial role in the forward-inverse optics model of Kawato, Hayakawa, and Inui (1993) and in the Helmholtz machine (Dayan, Hinton, Neal, & Zemel, 1995; Hinton, Dayan, Frey, & Neal, 1995), both of which share considerable similarities with the model proposed in this article and both of which inspired many of the choices made in the present model. The Kawato et al. (1993) model minimizes the sum of the reconstruction error and a regularizer term incorporating a priori knowledge about the visual world such as smoothness of representations. This is similar to the MDL optimization function that we use but without the stochastic perspective. The Helmholtz machine (Dayan et al., 1995) is a hierarchical network that builds a stochastic generative model by using an inverse (“recognition”) model to approximate the true posterior distribution of the input data. For modifying the feedback (generative) and feedforward (recognition) weights of a given stochastic Helmholtz machine, Hinton et al. (1995) propose an elegant learning method known as the “wake-sleep” algorithm, which is derived using the MDL principle. While the derivation of the algorithm necessitates some approximations and simplifications, which may sometimes prevent it from converging to the correct posterior distribu-
748
Rajesh P. N. Rao and Dana H. Ballard
tion, experimental results using the algorithm have been positive (Hinton et al., 1995). In this regard, the nonlinear form of the model presented in this article also uses an approximation: a first-order Taylor series approximation to the nonlinear activation function f . A possible weakness of the Helmholtz machine, as stated by Dayan et al. (1995), is that recognition in the hierarchical network is a purely bottom-up feedforward process; the generative weights are superfluous and play no role in mediating perception. In addition, there are no lateral interactions between units at a given level.7 The model presented in this article allows for relatively complex dynamic interactions to occur between top-down and bottom-up signals, in addition to some lateral interactions as given by equation 5.3 among the units at each level. Such top-down–bottom-up interactions have been shown to occur in a wide variety of visual tasks in humans (see Churchland, Ramachandran, & Sejnowski, 1994, for a review). Another issue not directly addressed by many of the models above is that of modeling the dynamic nature of vision and the role of prediction in visual processing (see Softky, 1996) for some arguments regarding the necessity of prediction during perception). The model presented here, on the other hand, is essentially a hierarchical predictor-estimator that continually refines its recognition estimates at each level, allowing the visual system to anticipate incoming stimuli. 9.4 Comparison with Related Work on Kalman Filters. During the past decade, Kalman filters have been used in computer vision and robotics for tackling a wide variety of problems ranging from motion estimation to contour tracking (Hallam, 1983; Broida & Chellappa, 1986; Ayache & Faugeras, 1986; Matthies, Kanade, & Szeliski, 1989; Blake & Yuille, 1992; Dickmanns & Mysliwetz, 1992; Pentland, 1992). Much of this work crucially hinges on the ability to formulate accurate dynamic-physical models of the object properties being estimated. The formulation of such hand-coded models, however, becomes increasingly difficult in more complex dynamic environments. A crucial difference between the present approach and the approach predominant in much of the Kalman filter literature is that rather than being concerned solely with the estimation of state for a predefined model, we suggest that Kalman filter–based estimation algorithms may also be used for simultaneously learning the feedforward, feedback, and prediction models (“measurement” and “system” models in Kalman filter terminology) as given by their respective weight matrices. It remains to be seen whether these learned models perform favorably in environments where hand-coded models have proved to be unsatisfactory. Another important difference is the use of a hierarchical form of the Kalman filter. Our formulation in this regard is in many ways similar to the recent proposals of Chou,
7 However, see Dayan and Hinton (1996) for several interesting variants of the Helmholtz machine that address some of these issues.
Model Predicts Neural Response Properties
749
Willsky, and Benveniste (1994), Chou, Willsky, and Nikoukhah (1994), and Luettgen and Willsky (1995) who describe an optimal estimation algorithm for a class of multiscale dynamic models. These dynamic models are similar in spirit to the one we described in section 3, albeit without the temporal prediction step. In particular, Chou et al. (1994) exploit the analogy between time and scale, and define scale-recursive linear dynamic models evolving on dyadic trees, r(s) = A(s)r(s − 1) + B(s)w(s)
(9.1)
I(s) = C(s)r(s) + v(s),
(9.2)
where s denotes the current scale on the dyadic tree (highest values denote the leaves), and w and v are zero-mean white noise processes. This formulation leads to an elegant estimation algorithm recursive in scale, consisting of an upward sweep in which information in a subtree is fused in a fineto-coarse fashion, followed by a downward sweep in which information is spread back through the tree. For the simple linear case without the temporal prediction step, the iterative algorithm presented in this article can be seen to converge to approximately the same solution as that found by the upward-downward sweep algorithm proposed by Chou et al. (1994).8 The equations used by Chou et al. (1994) above can be seen to be closely related to the following two equations (when defined on a dyadic tree) that formed part of the more general stochastic model described in section 3: r(s, t) = rtd (s, t) + ntd (s, t)
(9.3)
I(s, t) = U(s, t)r(s, t) + nbu (s, t),
(9.4)
where we may use rtd (s, t) = Ui (s − 1, t)r(s − 1, t) if rtd happens to be information from the next higher level and Ui is the part of the higher-level weight matrix U that generates inputs for the given lower-level module (in Fig. 2, for example, the matrix Ui for the first-level module corresponds to the top one-fourth rows of the second-level matrix U as depicted in the figure). The correspondence between the various terms in the two sets of equations (9.1 and 9.3, 9.2 and 9.4) is quite striking. However, a few important differences are also evident. While the linear dynamic system model in Chou et al.’s (1994) framework is used only to relate transitions from one scale to the next, the model used in this article can be seen to be recursive in both scale and time. The recursion in scale is, however, implicit since information from another scale is handled in a modular fashion simply as input from a separate source (as in equations 3.5 and 9.3). The corresponding estimation algorithm is thus recursive in time as in the standard Kalman filter and allows for prediction in the temporal domain at each hierarchical 8
We thank Peter Dayan for pointing out this equivalence.
750
Rajesh P. N. Rao and Dana H. Ballard
level. As stressed previously, the ability to predict is crucial for perception and action. Being strictly recursive in scale, the Chou et al. (1994) algorithm requires that covariance matrix information be propagated across hierarchical levels. In contrast, the model presented here maintains covariance information locally at each level. In a biological setting, the computation of covariance information may be a limiting step and is perhaps best kept local. A further advantage of our formulation is that it lends itself naturally to arbitrary varieties of modularization by allowing estimates of a given state variable from disparate sources to be integrated according to their signal-tonoise ratios (this article illustrates the example of top-down and bottom-up sources). A direct consequence of such a formulation is that one can model arbitrary connections among different (but related) cortical areas that do not necessarily have to be related in a strictly multiscale, hierarchical fashion as required by the framework of Chou et al. (1994). A final but important difference is the prominent role accorded to learning in our approach, a subject not addressed by Chou et al. (1994). 9.5 Comparison with Related Work on Learning and Parameter Estimation. The subject of learning synaptic weights that resemble receptive field weighting profiles in the visual cortex has received considerable attention within the neural modeling community. Early work showed that receptive fields resembling those of simple cells in the striate cortex could be learned from natural images using algorithms ranging from competitive learning (Barrow, 1987) to principal component analysis (Hancock, Baddeley, & Smith, 1992). However, these fell short of providing a complete characterization due to limitations such as proclivity toward producing global and/or orthogonal receptive fields. More recently, algorithms that produce better characterizations of cortical receptive fields have been proposed independently by Olshausen and Field (1996) and Harpur and Prager (1996) (see also Bell & Sejnowski, 1996), building on previous work by Daugman (1988) and Pece (1992). As alluded to in section 2, these algorithms are closely related to the learning rule expressed in equation 6.4. In particular, the “sparseness” function of Olshausen and Field (1996) corresponds to the MDL model prior term |L(r)| in equation 4.6. Using different encoding distributions gives rise to different degrees of sparseness and localization in the internal representations. For example, Olshausen and Field (1996) show how a variety of nonlinear sparseness functions can be used to generate localized receptive fields.9 In our case, the gaussian distribution assumed when evaluating |L(r)| leads to a linear activity penalty that favors distributed codes; localization in our model is provided by the already existing cortical topology of hierarchical, overlapping receptive fields, whose size at
9
tion.
However, Baddeley (1996) presents some arguments against sparseness maximiza-
Model Predicts Neural Response Properties
751
each level is predetermined by the local dendritic field. Since the learning algorithms of Harpur and Prager (1996) and Olshausen and Field (1996) are based on gradient descent (as described in section 2), they require the specification of a schedule for adapting the learning rate parameter, whereas this adaptation occurs naturally in equation 6.4 due to the gradual reduction in the noise covariance as more inputs become available. More important, the learning rule expressed by equation 6.4, as well as those proposed for the other weights in the model, take into account not only bottom-up information but also top-down expectations during weight modification. This allows for the possibility of developing top-down guided task-relevant internal representations.10 The learning method presented in this article is closely related to mean field methods for Bayesian belief networks (Saul, Jaakkola, & Jordan, 1996; Jaakkola, Saul, & Jordan, 1996), which use mean field equations for computing lower bounds on the log likelihood–based cost function. The network parameters (the weights and biases) are subsequently learned by performing gradient ascent on the mean field derived bound for the log likelihood. In comparison, the learning scheme presented in this article can be regarded essentially as employing an on-line form of the EM algorithm (Dempster et al., 1977) in which the M-step performs maximum a posteriori (MAP) estimation rather than maximum likelihood (ML) estimation. Note that ML estimation is simply a special case of MAP estimation in which there is no prior information (this corresponds to the case where no predictions are computed from prior data as, for example, in section 2). In particular, the on-line MAP form of the EM algorithm is invoked in three different ways in the present framework: 1. Estimation of the state r. In this case, the “hidden” variables (or “missing” data) consist of the weights U and V while we treat the state vector r as the parameter to be estimated. Thus, the E-step involves computing the gaussian distributions given by (u, S) and (v, T). The M-step then uses these distributions from the E-step to computeb r and P, thereby maximizing the posterior probability of the parameter vector r (by minimizing the MDL-based cost function of section 4). Note, however, that in the case where the activation function f is nonlinear, the maximization in the M-step is not exact due to the Taylor series approximation to f used in the various update equations (the same applies to the other M-steps below). 2. Estimation of the generative weights u. Here, the hidden variables consist of the state vector r and the weights V, while the parameter to 10 A hierarchical learning method for vector quantization in the presence of noise is presented in Luttrell (1992). This method allows single-cause models, whereas the present model can encode multiple-cause models (Dayan & Zemel, 1995; Saund, 1995; see also Luttrell, 1995).
752
Rajesh P. N. Rao and Dana H. Ballard
be estimated is U. The E-step thus involves computing the distribur(t|N), P(t|N)); see below) tions given by (r, M) (or when feasible, (b u and Pu in the M-step, as and (v, T), which allow the calculation of b discussed in section 6. 3. Estimation of prediction weights v. In this case, the hidden variables comprise the state vector r and the weights U, and the parameter to be estimated is the state transition matrix V. Since V implements a Markov chain relating the states r in time (see equation 3.7), the E-step in this case becomes slightly more complicated than in the two cases above since the optimal estimate of the state r is no longer causal but depends temporally on future as well as past and current data values. In the case of hidden Markov models (HMMs) (Rabiner & Juang, 1986) (which are analogous to Kalman filters with discrete hidden states), this impasse is solved by using the familiar forward-backward procedure within the context of the Baum-Welch algorithm (Baum et al., 1970) (the Baum-Welch algorithm is, incidentally, a special case of the EM algorithm). In our case, we may use a form of optimal smoothing (Bryson & Ho, 1975), which, when given the batch input data for time t = 1, . . . , N, optimally assimilates past, current, and future data into temporally smoothed state estimates b r(t|N) (with covariance P(t|N)) for each of the time instants t = 1, . . . , N. This, along with the computation of (u, S), comprises the E-step. The M-step then involves computing b v and Pv as described in the appendix. It may be desirable in many cases to implement the various E- and Msteps above for each set of parameters in an asynchronous manner rather than in strict alternation. For example, the cortex may choose to update its estimates for the state r much more frequently than those for U and V in order to act in real time. The updates for U and V could then be carried out either online but at a slower rate than r, or alternately when the organism is in a rest phase. The latter suggestion is especially attractive in the case of learning V given that the batch processing of data required by the optimal smoothing step suggested above may not be appropriate in many circumstances that demand real-time responses from the organism. In such cases, it might be beneficial for the organism to act based on its current online state estimates while simultaneously storing behaviorally relevant input sequences in memory as batch data. Such stored data may then be replayed out at later, when the organism is in a more quiescent phase, in order to learn or fine-tune the prediction weights V. This suggests a possible computational role for dream sleep in facilitating learning and memory. 10 Summary and Conclusion Vision is fundamentally a dynamic process. Any putative model of the visual cortex must therefore acknowledge this fact and allow the modeling
Model Predicts Neural Response Properties
753
of dynamic processes. This article suggests that the hierarchical structure of the visual cortex is well suited to implement a multiscale estimation algorithm that can be regarded as a hierarchical form of the extended Kalman filter (Maybeck, 1979). At each hierarchical level, the algorithm recursively estimates the current recognition state by combining “measurement” information from top-down and bottom-up cortical modules with the prediction that was made by using a “system” model. The “system” matrix (the prediction weights) as well as the “measurement” matrices (feedforward and feedback weights) can be learned from visual experiences in a dynamic environment. Thus, the adaptive processes in the cortex are modeled at two different time scales: a fast dynamic state estimation process that allows the visual system to anticipate incoming stimuli and a slower Hebbian synaptic weight change mechanism. Both processes are characterized by their respective mean and covariance estimates, which are continually updated. Some of the salient features of the model as a recognition system were illustrated by training it on a small number of man-made objects and testing recognition performance. The model was shown to be robust to partial occlusions and confusing stimuli. An ongoing effort involves a more quantitative evaluation of the model’s recognition performance using realistic objects in natural scenes and the validation of the model in the domain of human visual search. Preliminary results along these two directions have been promising (Rao & Ballard, 1995a; Rao, Zelinsky, Hayhoe, & Ballard, 1996). When trained on natural images, the model developed feedforward receptive fields qualitatively resembling those found at the early hierarchical levels of the primate visual cortex. Current simulations involve applying the learning algorithms to successively higher levels of the network and comparing the receptive field properties that develop to those found at higher cortical levels such as IT. The hierarchical model also provides a computational platform for understanding a number of neurophysiologicalpsychophysical phenomena, such as illusory contours, amodal completions, visual imagery, bistable percepts, end stopping, and other effects that occur due to stimuli from beyond the classical receptive field. In particular, we provided simulations of the model suggesting that some of the recent results of Gallant et al. (1994, 1995; Gallant, 1996) regarding neural response suppression during free viewing of natural scenes may be interpreted as occurring due to predictions from higher levels of the cortex. While the present framework illustrates how information from two different sources, one top down and the other bottom up, may be integrated in a given cortical area, it is not hard to envision extensions of the model where more than two areas feed into a single area, as is evident in the visual cortex (Felleman & Van Essen, 1991). This would entail the relatively straightforward exercise of adding additional error/model terms to the optimization equations 4.5 and 4.6 to account for inputs from the additional areas before deriving the estimation algorithms.
754
Rajesh P. N. Rao and Dana H. Ballard
A reassuring feature of the model is that it does not explicitly depend on the input signals being visual. Thus, given that the neocortex is organized along similar laminar input-output principles regardless of cortical area or input modality (Creutzfeldt, 1977), it is reasonable to assume that the general framework proposed here may be uniformly applicable to other cortical areas, such as the parietal, auditory, or motor cortex, as well.11 We expect the results obtained using the current model to play a crucial role in guiding our future cortical modeling efforts. Appendix: Learning the Prediction Weights While the simulations in the article used static stimuli and therefore did not make use of the prediction step, it is almost certain that such predictions play an important role in the processing of time-varying stimuli. Therefore, we suggest a possible learning procedure for modifying the synaptic weights v of the (possibly nonlinear) prediction units during exposure to timevarying stimuli. The derivation is similar to that for the feedback weights u. b Given the current vector of responsesb r at time t, define the k × k2 matrix R to be b= R
b rT 0 .. . 0
0 b rT .. . ...
... 0 ... 0 .. .. . . 0 b rT
.
(A.1)
Notice that the prediction step can be stated as: r(t + 1) = f (V(t)b r(t)) + n(t) b = f (R(t)v(t)) + n(t).
(A.2)
Differentiating the cost function J with respect to the vector of prediction weights v and setting the result to zero, we obtain, using equation A.2 for 11 We have recently shown (Rao & Ballard, 1996) how the model presented in this article can be extended to handle transformations in the image plane. The resulting estimation scheme parallels the functional dichotomy between the dorsal (occipitoparietal) and ventral (occipitotemporal) pathways in the primate visual cortex (Felleman & Van Essen, 1991). Support for the possibility of modeling the motor cortex using Kalman filter-like mechanisms comes from the work of Wolpert, Ghahramani, and Jordan (1995) who provide strong psychophysical evidence for the existence of such mechanisms in the context of sensorimotor integration.
Model Predicts Neural Response Properties
755
r(t + 1), µ −
∂f ∂v
¶T
b v(t)) − n(t)) + T−1 (b M−1 (r(t + 1) − f (R(t)b v(t) − v(t)) + βb v(t) = 0, (A.3)
which yields the update rule (for time t), µ b v(t) = v(t) + Pv
∂f ∂v
¶T
M−1 (r(t + 1)
b − n(t)) − βPv v(t), − f (R(t)v(t))
(A.4)
or more simply, µ b v(t) = v(t) + Pv
∂f ∂v
¶T
M−1 (r(t + 1) − r(t + 1)) − βPv v(t),
(A.5)
b v(t) + nv (t) and r(t + 1) = f (R(t)v(t)) + n(t). The correwhere v(t + 1) = b sponding covariance matrix is updated at time t as follows, −1
Pv (t) = (T(t)
µ +
∂f ∂v
¶T
M(t)−1
∂f + βI)−1 , ∂v
(A.6)
where I is the k2 × k2 identity matrix. The partial derivatives are evaluated at v(t) = v(t). Note that in order to obtain a closed-form expression for updating b v and Pv , we once again applied a first-order Taylor series approximation to f around the current estimate v(t). The covariance matrix T gets updated according to T(t + 1) = Pv (t) + 6v (t).
(A.7)
By appealing to a version of the EM algorithm (see section 9.5), we may use r(t + 1) = b r(t + 1|N) in equation A.5 above, where b r(t + 1|N) is the optimal temporally smoothed state estimate (Bryson & Ho, 1975) for time t + 1 (≤ N), given input data for each of the time instants 1, . . . , N (see the discussion in section 9.5). In certain circumstances, it might be possible to use the online estimateb r(t+1) in place of its computationally more expensive smoothed counterpartb r(t + 1|N), though such an approximation has been known to yield negative results in other related network architectures (Peter Dayan, pers. comm.). Simulations are currently underway to evaluate the effectiveness of the above learning rule and the role of temporally smoothing the state estimates. Intuitively, equation A.5 may be regarded as a form of quasi-Hebbian adaptation involving the presynaptic activity ∂ f /∂v, which is simply the
756
Rajesh P. N. Rao and Dana H. Ballard
b at time t for linear neurons, and the postsynaptic activity response R(t) (b r(t + 1|N) − r(t + 1)), which indicates the error in prediction. Computing the difference in the prediction error term requires the presence of some form of inhibitory interneurons. Inhibitory interneurons are also required by some of the other estimation algorithms derived in this article. Interestingly, a wide variety of such interneurons have been discovered in the primate visual cortex (Jones, 1981). Acknowledgments We are grateful to Peter Dayan and the two anonymous reviewers for their constructive comments and suggestions that immensely helped in improving the quality of the article. We are particularly indebted to Peter Dayan for extensive criticism of an earlier version, which inspired the more general formulation of the approach as described here. Thanks are also due to Jack Gallant for many interesting conversations, for reading drafts of the article, and for the large number of suggestions and clarifications he provided during the course of this work. We also thank Helder Araujo, Jessica Bayliss, Chris Brown, Avis Cohen, Mary Hayhoe, Kiriakos Kutulakos, Peter Lennie, Bill Merigan, Robbie Jacobs, Gerhard Sagerer, David Williams, and Gregory Zelinsky for pointers, comments, and discussions regarding the article. Some of the images used in the simulations were taken from the Columbia object database, courtesy of Shree Nayar. This research was supported by NIH/PHS research grants 1-P41-RR09283 and 1-R24-RR06853-02, and by NSF research grants IRI-9406481 and IRI-8903582. References Allman, J., Miezin, F., & McGuinness, E. (1985). Stimulus specific responses from beyond the classical receptive field: Neurophysiological mechanisms for local-global comparisons in visual neurons. Annual Review of Neuroscience, 8, 407–429. Andersen, R. A., Essick, G. K., & Siegel, R. M. (1985). Encoding of spatial location by posterior parietal neurons. Science, 230, 456–458. Andersen, C. H., & Van Essen, D. C. (1987). Shifter circuits: A computational strategy for dynamic aspects of visual processing. Proceedings of the National Academy of Sciences, 84, 6297–6301. Ayache, N., & Faugeras, O. D. (1986). HYPER: A new approach for the recognition and positioning of two-dimensional objects. IEEE Trans. Pattern Analysis and Machine Intelligence, 8(1), 44–54. Baddeley, R. (1996). Searching for filters with “interesting” output distributions: An uninteresting direction to explore? Network, 7(2), 409–421. Baizer, J. S., Robinson, D. L., & Dow, B. M. (1977). Visual responses of area 18 neurons in awake, behaving monkey. Journal of Neurophysiology, 40, 1024– 1037.
Model Predicts Neural Response Properties
757
Ballard, D. H., Hayhoe, M. M., Pook, P. K., & Rao, R. P. N. (in press). Deictic codes for the embodiment of cognition. Behavioral and Brain Sciences. Barrow, H. G. (1987). Learning receptive fields. In Proceedings of the IEEE Int. Conf. on Neural Networks, pp. 115–121. Baum, L. E., Petrie, T., Soules, G., & Weiss, N. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164–171. Bell, A. J., & Sejnowski, T. J. (in press). The “independent components” of natural scenes are edge filters. Submitted to Vision Research, 1996. Blake, A., & Yuille, A. (Eds.). (1992). Active vision. Cambridge, MA: MIT Press. Bolz, J., & Gilbert, C. D. (1986). Generation of end-inhibition in the visual cortex via interlaminar connections. Nature, 320, 362–365. Broida, T. J., & Chellappa, R. (1986). Estimation of object motion parameters from noisy images. IEEE Trans. Pattern Analysis and Machine Intelligence, 8(1), 90–99. Bryson, A. E., & Ho, Y.-C. (1975). Applied optimal control. New York: Wiley. Carpenter, G. A., & Grossberg, S. (1987). A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37, 54–115. Chatfield, C., & Collins, A. J. (1980). Introduction to multivariate analysis. New York: Chapman and Hall. Chou, K. C., Willsky, A. S., & Benveniste, A. (1994). Multiscale recursive estimation, data fusion, and regularization. IEEE Transactions on Automatic Control, 39(3), 464–478. Chou, K. C., Willsky, A. S., & Nikoukhah, R. (1994). Multiscale systems, Kalman filters, and Riccati equations. IEEE Transactions on Automatic Control, 39(3), 479–492. Churchland, P., & Sejnowski, T. (1992). The computational brain. Cambridge, MA: MIT Press. Churchland, P. S., Ramachandran, V. S., & Sejnowski, T. J. (1994). A critique of pure vision. In C. Koch and J. L. Davis (Eds.), Large-Scale Neuronal Theories of the Brain (pp. 23–60). Cambridge, MA: MIT Press. Creutzfeldt, O. D. (1977). Generality of the functional structure of the neocortex. Naturwissenschaften, 64, 507–517. Daugman, J. G. (1980). Two-dimensional spectral analysis of cortical receptive field profiles. Vision Research, 20, 847–856. Daugman, J. G. (1988). Complete discrete 2-D Gabor transforms by neural networks for image analysis and compression. IEEE Trans. Acoustics, Speech, and Signal Proc., 36(7), 1169–1179. Dayan, P., & Hinton, G. E. (1996). Varieties of Helmholtz machine. Neural Networks, 9(8), 1385–1403. Dayan, P., & Zemel, R. S. (1995). Competition and multiple cause models. Neural Computation, 7, 565–579. Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmholtz machine. Neural Computation, 7, 889–904.
758
Rajesh P. N. Rao and Dana H. Ballard
Deacon, T. (1989). Holism and associationism in neuropsychology: An anatomical synthesis. In E. Perecman (Ed.), Integrating Theory and Practice in Clinical Neuropsychology (pp. 1–47). Hillsdale, NJ: Erlbaum. DeAngelis, G. C., Robson, J. G., Ohzawa, I., & Freeman, R. D. (1992). Organization of suppression in receptive fields of neurons in cat visual cortex. Journal of Neurophysiology, 68(1), 144–163. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society Series B, 39, 1–38. Desimone, R., & Schein, S. J. (1987). Visual properties of neurons in area V4 of the macaque: sensitivity to stimulus form. Journal of Neurophysiology, 57, 835–868. Desimone, R., & Ungerleider, L. G. (1989). Neural mechanisms of visual processing in monkeys. In F. Boller & J. Grafman (Eds.), Handbook of Neuropsychology (Vol. 2, pp. 267–299). New York: Elsevier. Dickmanns, E. D., & Mysliwetz, B. D. (1992). Recursive 3D road and relative ego-state recognition. IEEE Trans. Pattern Analysis and Machine Intelligence, 14(2), 199–213. Douglas, R. J., Martin, K. A. C., & Whitteridge, D. (1989). A canonical microcircuit for neocortex. Neural Computation, 1, 480–488. Duhamel, J., Colby, L., & Goldberg, M. E. (1992). The updating of the representation of visual space in parietal cortex by intended eye movements. Science, 255, 90–92. Edelman, G. M. (1978). Group selection and phasic re-entrant signaling: A theory of higher brain function. In V. B. Mountcastle & G. M. Edelman (Eds.), The Mindful Brain (pp. 51–100). Cambridge, MA: MIT Press. Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1, 1–47. Feller, W. (1968). An introduction to probability theory and its applications (Vol. 1). New York: Wiley. Fukushima, K. (1988). A neural network for visual pattern recognition. Computer, 21(3), 65–75. Gallant, J. L. (1996). Cortical responses to natural scenes under controlled and free viewing conditions. Invest. Ophthalmol. Vis. Sci., 37(3), 674. Gallant, J. L., Connor, C. E., & Van Essen, D. C. (1994). Responses of visual cortical neurons in a monkey freely viewing natural scenes. Soc. Neuro. Abstr., 20, 838. Gallant, J. L., Connor, C. E., Drury, W., & Van Essen, D. C. (1995). Neural responses in monkey visual cortex during free viewing of natural scenes: Mechanisms of response suppression. Invest. Ophthalmol. Vis. Sci., 36, 1052. Gilbert, C. D., & Wiesel, T. N. (1981). Laminar specialization and intracortical connections in cat primary visual cortex. In F. O. Schmitt, F. G. Worden, G. Adelman, & S. G. Dennis (Eds.), The Organization of the Cerebral Cortex (pp. 163–191). Cambridge, MA: MIT Press. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Computation, 7(2), 219–269. Grenander, U. (1976–1981). Lectures in pattern theory I, II and III: Pattern analysis, pattern synthesis and regular structures. Berlin: Springer-Verlag.
Model Predicts Neural Response Properties
759
Grimes, J., & McConkie, G. (1995). On the insensitivity of the human visual system to image changes made during saccades. In K. Akins (Ed.), Problems in Perception. Oxford: Oxford University Press. Grosof, D. H., Shapley, R. M., & Hawken, M. J. (1992). Macaque striate responses to anomalous contours? Invest. Ophthalmol. Vis. Sci., 33, 1257. Gulyas, B., Orban, G. A., Duysens, J., & Maes, H. (1987). The suppressive influence of moving textured backgrounds on responses of cat striate neurons to moving bars. Journal of Neurophysiology, 57(6), 1767–1791. Hallam, J. (1983). Resolving observer motion by object tracking. In Proc. of 8th International Joint Conf. on Artificial Intelligence (Vol. 2, pp. 792–798). Hancock, P. J. B., Baddeley, R. J., & Smith, L. S. (1992). The principal components of natural images. Network, 3, 61–70. Harpur, G. F., & Prager, R. W. (1996). Development of low-entropy coding in a recurrent network. Network, 7(2), 277–284. Harth, E., Unnikrishnan, K. P., & Pandya, A. S. (1987). The inversion of sensory processing by feedback pathways: A model of visual cognitive functions. Science, 237, 184–187. Heeger, D. J., Simoncelli, E. P., & Movshon, J. A. (1996). Computational models of cortical visual processing. Proc. National Acad. Sciences, 93, 623–627. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1158–1161. Hubel, D. H. (1988). Eye, brain, and vision. New York: Scientific American Library. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction, and functional architecture in the cat’s visual cortex. Journal of Physiology (London), 160, 106–154. Hubel, D. H., & Wiesel, T. N. (1965). Receptive fields and functional architecture in two nonstriate visual areas (18 and 19) of the cat. Journal of Neurophysiology, 28, 229–289. Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology (London), 195, 215–243. Irwin, D. E., McConkie, G. W., Carlson-Radvansky, L., & Currie, C. (1994). A localist evaluation solution for visual stability across saccades. Behavioral and Brain Sciences, 17, 265–266. Jaakkola, T., Saul, L. K., & Jordan, M. I. (1996). Fast learning by bounding likelihoods in sigmoid type belief networks. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in Neural Information Processing Systems 8 (pp. 528–534). Cambridge, MA: MIT Press. Jones, E. G. (1981). Anatomy of cerebral cortex: Columnar input-output organization. In F. O. Schmitt, F. G. Worden, G. Adelman, & S. G. Dennis (Eds.), The Organization of the Cerebral Cortex (pp. 199–235). Cambridge, MA: MIT Press. Jordan, M. I., & Rumelhart, D. E. (1992). Forward models: Supervised learning with a distal teacher. Cognitive Science, 16, 307–354. Kalman, R. E. (1960). A new approach to linear filtering and prediction theory. Trans. ASME J. Basic Eng., 82, 35–45. Kalman, R. E., & Bucy, R. S. (1961). New results in linear filtering and prediction theory. Trans. ASME J. Basic Eng., 83, 95–108.
760
Rajesh P. N. Rao and Dana H. Ballard
Kanizsa, G. (1990). Subjective contours. In I. Rock (Ed.), The Perceptual World: Readings from Scientific American Magazine (pp. 155–163). San Francisco: Freeman. Kawato, M., Hayakawa, H., & Inui, T. (1993). A forward-inverse optics model of reciprocal connections between visual cortical areas. Network, 4, 415–422. Koch, C. (1996). Towards the neuronal substrate of visual consciousness. In S. R. Hameroff, A. W. Kaszniak, & A. C. Scott (Eds.), Towards a Science of Consciousness: The First Tucson Discussions and Debates. Cambridge, MA: MIT Press. Kosslyn, S. M., Thompson, W. I., Kim, I. J., & Alpert, N. M. (1995). Topographical representations of mental images in primary visual cortex. Nature, 378, 496– 498. Li, M., & Vitanyi, P. (1993). An introduction to Kolmogorov complexity and its applications. New York: Springer-Verlag. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21(3), 105–117. Luettgen, M. R., & Willsky, A. S. (1995). Likelihood calculation for a class of multiscale stochastic models, with application to texture discrimination. IEEE Transactions on Image Processing, 4(2), 194–207. Lund, J. S. (1981). Intrinsic organization of the primate visual cortex, area 17, as seen in Golgi preparations. In F. O. Schmitt, F. G. Worden, G. Adelman, & S. G. Dennis (Eds.), The Organization of the Cerebral Cortex (pp. 105–124). Cambridge, MA: MIT Press. Luttrell, S. P. (1992). Self-supervised adaptive networks. IEE Proceedings Part F, 139, 371–377. Luttrell, S. P. (1995). A componential self-organising neural network (Tech. Rep. No. DRA/CIS(SE1)/651/11/RP/3.1). Malvern, UK: Defence Research Agency. MacKay, D. M. (1956). The epistemological problem for automata. In C. E. Shannon & J. McCarthy (Eds.), Automata Studies (pp. 235–251). Princeton, NJ: Princeton University Press. Marcelja, S. (1980). Mathematical description of the responses of simple cortical cells. Journal of the Optical Society of America, 70, 1297–1300. Matthies, L., Kanade, T., & Szeliski, R. (1989). Kalman filter-based algorithms for estimating depth from image sequences. International Journal of Computer Vision, 3, 209–236. Maunsell, J. H. R., & Newsome, W. T. (1987). Visual processing in monkey extrastriate cortex. Annual Review of Neuroscience, 10, 363–401. Maybeck, P. S. (1979). Stochastic models, estimation, and control (Vols. 1, 2). New York: Academic Press. McConkie, G. W. (1991). Perceiving a stable visual world. In Proceedings of the Sixth European Conference on Eye Movements (pp. 5–7). Leuven, Belgium: Laboratory of Experimental Psychology. Mignard, M., & Malpeli, J. G. (1991). Paths of information flow through visual cortex. Science, 251, 1249–1251. Miller, E. K., Li, L., & Desimone, R. (1991). A neural mechanism for working and recognition memory in inferior temporal cortex. Science, 254, 1377–1379.
Model Predicts Neural Response Properties
761
Muller, J. R., Singer, B., Krauskopf, J., & Lennie, P. (1996). Center-surround contrast effects in macaque cortex. Invest. Ophthalmol. Vis. Sci., 37(3), 904. Mumford, D. (1994). Neuronal architectures for pattern-theoretic problems. In C. Koch & J. L. Davis (Eds.), Large-Scale Neuronal Theories of the Brain (pp. 125– 152). Cambridge, MA: MIT Press. Murphy, P. C., & Sillito, A. M. (1987). Corticofugal feedback influences the generation of length tuning in the visual pathway. Nature, 329, 727–729. Nelson, J. I., & Frost, B. J. (1978). Orientation-selective inhibition from beyond the classic visual receptive field. Brain Research, 139, 359–365. Nowlan, S. J., & Hinton, G. E. (1992). Simplifying neural networks by soft weightsharing. Neural Computation, 4(4), 473–493. Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of Neural Systems, 1, 61–68. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Oram, M. W., & Perrett, D. I. (1992). Time course of neural responses discriminating different views of the face and head. Journal of Neurophysiology, 68(1), 70–84. Palmer, L. A., Jones, J. P., & Stepnoski, R. A. (1991). Striate receptive fields as linear filters: Characterization in two dimensions of space. In A. G. Leventhal (Ed.), The Neural Basis of Visual Function (pp. 246–265). Boca Raton: CRC Press. Pece, A. E. C. (1992). Redundancy reduction of a Gabor representation: A possible computational role for feedback from primary visual cortex to lateral geniculate nucleus. In I. Aleksander & J. Taylor (Eds.), Artificial Neural Networks 2 (pp. 865–868). Amsterdam: Elsevier Science. Pentland, A. P. (1992). Dynamic vision. In G. A. Carpenter & S. Grossberg (Eds.), Neural Networks for Vision and Image Processing (pp. 133–159). Cambridge, MA: MIT Press. Pitts, W., & McCulloch, W. S. (1947). How we know universals: The perception of auditory and visual forms. Bulletin of Mathematical Biophysics, 9:127–147. Poggio, T. (1990). A theory of how the brain might work. In Cold Spring Harbor Symposia on Quantitative Biology (pp. 899–910). Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. Poggio, T., Fahle, M., & Edelman, S. (1992). Fast perceptual learning in visual hyperacuity. Science, 256, 1018–1021. Poggio, T., Torre, V., & Koch, C. (1985). Computational vision and regularization theory. Nature, 317, 314–319. Rabiner, L. R., & Juang, B. H. (1986). An introduction to hidden Markov models. IEEE ASSP Magazine, 3, 4–16. Rao, R. P. N. (1996). Robust Kalman filters for prediction, recognition, and learning (Tech. Rep. No. 645). Rochester, NY: Dept. of Computer Science, University of Rochester. Rao, R. P. N., & Ballard, D. H. (1995a). An active vision architecture based on iconic representations. Artificial Intelligence (Special Issue on Vision), 78, 461– 505. Rao, R. P. N., & Ballard, D. H. (1995b). Dynamic model of visual memory predicts neural response properties in the visual cortex (Tech. Rep. No. 95.4). Rochester,
762
Rajesh P. N. Rao and Dana H. Ballard
NY: National Resource Laboratory for the Study of Brain and Behavior, Department of Computer Science, University of Rochester. Rao, R. P. N., & Ballard, D. H. (1996). A class of stochastic models for invariant recognition, motion, and stereo (Tech. Rep. No. 96.1). Rochester, NY: National Resource Laboratory for the Study of Brain and Behavior, Department of Computer Science, University of Rochester. Rao, R. P. N., Zelinsky, G. J., Hayhoe, M. M., & Ballard, D. W. (1996). Modeling saccadic targeting in visual search. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in Neural Information Processing Systems 8 (pp. 830–836). Cambridge, MA: MIT Press. Rissanen, J. (1989). Stochastic complexity in statistical inquiry. Singapore: World Scientific. Rockland, R. S., & Pandya, D. N. (1979). Laminar origins and terminations of cortical connections of the occipital lobe in the rhesus monkey. Brain Research, 179, 3–20. Rolls, E. (1989). The representation and storage of information in neuronal networks in the primate cerebral cortex and hippocampus. In R. Durbin, C. Miall, & G. Mitchison (Eds.), The Computing Neuron (pp. 125–159). Reading, MA: Addison-Wesley. Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4, 61–76. Saund, E. (1995). A multiple cause mixture model for unsupervised learning. Neural Computation, 7, 51–71. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656. Softky, W. R. (1996). Unsupervised pixel-prediction. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in Neural Information Processing Systems 8 (pp. 809–815). Cambridge, MA: MIT Press. Thorpe, S. J., & Imbert, M. (1989). Biological contraints on connectionist modelling. In R. Pfeifer, Z. Schreter, F. Fogelman-Soulie, & L. Steels (Eds.), Connectionism in Perspective (pp. 63–92). Amsterdam: Elsevier. Tovee, M. J., Rolls, E. T., Treves, A., & Bellis, R. P. (1993). Information encoding and the responses of single neurons in the primate temporal visual cortex. Journal of Neurophysiology, 70(2), 640–654. Ullman, S. (1994). Sequence seeking and counterstreams: A model for bidirectional information flow in the cortex. In C. Koch & J. L. Davis (Eds.), LargeScale Neuronal Theories of the Brain (pp. 257–270). Cambridge, MA: MIT Press. Van Essen, D. C. (1985). Functional organization of primate visual cortex. In A. Peters & E. G. Jones (Eds.), Cerebral Cortex (vol. 3, pp. 259–329). New York: Plenum. Van Essen, D. C., Anderson, C. H., & Olshausen, B. A. (1994). Dynamic routing strategies in sensory, motor, and cognitive processing. In C. Koch & J. L. Davis (Eds.), Large-Scale Neuronal Theories of the Brain (pp. 271–299). Cambridge, MA: MIT Press. Van Essen, D. C., & Maunsell, J. H. R. (1983). Hierarchical organization and functional streams in the visual cortex. Trends in Neuroscience, 6, 370–375.
Model Predicts Neural Response Properties
763
von der Heydt, R., Peterhans, E., & Baumgartner, G. (1984). Illusory contours and cortical neuron responses. Science, 224, 1260–1262. Williams, R. J. (1985). Feature discovery through error-correction learning (Tech. Rep. No. 8501). Institute for Cognitive Science, University of California at San Diego. Wolpert, D. M., Ghahramani, Z., & Jordan, M. I. (1995). An internal model for sensorimotor integration. Science, 269, 1880–1882. Young, R. A. (1985). The gaussian derivative theory of spatial vision: Analysis of cortical cell receptive field line-weighting profiles (Research Publication GMR4920). Warren, MI: General Motors. Zeki, S. M. (1976). The functional organization of projections from striate to prestriate visual cortex in the rhesus monkey. Cold Spring Harbor Symp. Quant. Biology, 40, 591–600. Zeki, S. M. (1978). Uniformity and diversity of structure and function in rhesus monkey prestriate visual cortex. Journal of Physiology (London), 277, 273–290. Zeki, S. M. (1983). Colour coding in the cerebral cortex: The reaction of cells in monkey visual cortex to wavelengths and colours. Neuroscience, 9, 741–765. Zemel, R. S. (1994). A minimum description length framework for unsupervised learning. Unpublished doctoral dissertation, Department of Computer Science, University of Toronto.
Received January 4, 1996; accepted August 6, 1996.
NOTE
Communicated by Jonathan Baxter
Correction to “Lower Bounds on VC-Dimension of Smoothly Parameterized Function Classes”1 Wee Sun Lee Peter L. Bartlett Department of Systems Engineering, RSISE, Australian National University, Canberra, ACT 0200, Australia
Robert C. Williamson Department of Engineering, Australian National University, Canberra, ACT 0200, Australia
The earlier article gives lower bounds on the VC-dimension of various smoothly parameterized function classes. The results were proved by showing a relationship between the uniqueness of decision boundaries and the VC-dimension of smoothly parameterized function classes. The proof is incorrect; there is no such relationship under the conditions stated in the article. For the case of neural networks with tanh activation functions, we give an alternative proof of a lower bound for the VC-dimension proportional to the number of parameters, which holds even when the magnitude of the parameters is restricted to be arbitrarily small.
Theorem 10 from the earlier article, which was used to prove lower bounds on various neural network classes, is incorrect: Theorem 10. Let A be an open subset of Rm , X be an open subset of Rn , and f : A × X → R be a continuously differentiable function (in all of its arguments). Let Fθ := {θ( f (a, ·)): a ∈ A} where θ (y) = 1 for y > 0 and θ (y) = 0 otherwise. If there exists a k-dimensional manifold M ⊂ A such that for all a1 , a2 ∈ M, a1 6= a2 implies that {x ∈ X: f (a1 , x) = 0} 6= {x ∈ X: f (a2 , x) = 0} then VCdim(Fθ ) ≥ k. As counterexamples, observe that y = (a−x)2 has unique decision boundaries but VC-dimension zero. Similarly y = (x−a)(x−b)2 has unique decision boundaries but VC-dimension 1.
1
W. S. Lee, P. L. Bartlett, & R. C. Williamson, Neural Computation, 7(5), 1040–1053 (1995).
Neural Computation 9, 765–769 (1997)
c 1997 Massachusetts Institute of Technology °
766
Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson
Theorem 10 was used to prove a lower bound for the VC-dimension of neural networks with tanh activation functions. We can prove this lower bound more directly using Lemma 9 from the same article: Lemma 9. Let A be an open subset of Rm , X be an open subset of Rn , and f : A × X → R be a continuously differentiable function (in all of its arguments). Let Fθ := {θ( f (a, ·)): a ∈ A} where θ (y) = 1 for y > 0 and θ (y) = 0 otherwise. Let g: A × Xm 7→ Rm be defined by g(a, x1 , . . . , xm ) = ( f (a, x1 ), . . . , f (a, xm ))T . Let gx (a) = g(a, x). Let ϕ be any diffeomorphism from an open subset of A to a subset of Rm and ψx = gx ◦ ϕ −1 . If VCdim(Fθ ) < k, then for every a ∈ A, for every x ∈ Xm , gx (a) = 0 ⇒ rank(Dψx (b)) < k, where b = ϕ(a). Definition 1. A two-layer feedforward network with an n-dimensional input x = (x1 , . . . , xn ) ∈ Rn , k hidden units with σ activation function, and weights w = (v10 , . . . , vkn , w0 , . . . , wk ) ∈ RW (where W = kn + 2k + 1) is a function f : RW × Rn → R given by f (w, x) = w0 +
k X
wi σ (vi · x + vi0 ),
i=1
where vi = (vi1 , . . . , vin ) and vi · x = are called the offsets.
Pn
j=1 vij xj . The weights w0 , vi0 , i
= 1, . . . , k
First we give some conditions on the activation function σ for which our result below holds. These conditions are the similar to those used in Albertini, Sontag, and Maillot (1993) for proving the independence property. The function tanh satisfies these conditions. Definition 2. The function σ is admissible if it satisfies the following conditions. It is a real-analytic function, and it extends to a function σ : C → C analytic on a subset D ⊂ C of the form
D = {z ∈ C: |Im z| ≤ λ}\{z0 , z0 }, where Im z0 = λ and ∞ > λ > 0. Here z0 is the complex conjugate of z0 , and z0 and z0 are poles of σ . We want to use Lemma 9 to find a lower bound for the VC-dimension of a two-layer feedforward network with admissible activation functions. First we show that the following matrix of derivatives with respect to first-layer weights is nonsingular for almost all sets of kn inputs S = {x1 , . . . , xkn }. (The
Correction
767
technique used is similar to that in Bartlett & Dasgupta 1995.) Define D(S) = (1) 0 σ (v1 · x1 + v10 )w1 x11 σ 0 (v1 · x2 + v10 )w1 x21 · · · σ 0 (v1 · xkn + v10 )w1 xkn 1 .. σ 0 (v · x1 + v )w x1 · · · . 1 10 1 2 . .. 0 1 1 σ (v1 · x + v10 )w1 xn .. . 0 σ (vk · x1 + vk0 )wk x11 .. . σ 0 (vk · x1 + vk0 )wk x1n · · · Let ωi,j = vi · x j + vi0 . Define a diagonal matrix Z(S) = diag((ω1,1 − p1,1 )m , . . . , (ω1,n − p1,n )m , (ω2,n+1 − p2,n+1 )m , . . . , (ωk,kn − pk,kn )m ) where m is the multiplicity of the pole of σ 0 (·) at pi,j and pi,j ∈ {z0 , z0 } for each element of Z(S). We will show that there is a value of S such that D(S)Z(S) is nonsingular, which implies that D(S) is nonsingular. Each component of D(S) and Z(S) is an analytic function of S on the set Dkn ; hence the determinant is also an analytic function of S. This means that the determinant is nonzero for almost all values of S in Dkn and hence also for almost all values of S when each member of S is from Rn . Pick an S such that vi · x j + vi0 = pi,j for each pi,j which appears in a component of Z(S) and for all other i, j, vi · x j + vi,0 ∈ D and is not a pole of σ 0 (·). This can be done as long as no two hidden units are identical. (For each hidden unit i, we can choose its n corresponding x j ’s so that the activations are at the same pole and avoid the hyperplanes defining the poles of the other hidden units. If we choose the components of each x j small enough, the activation will remain in D.) All the singularities in D(S)Z(S) are removable, and the determinant can be made an analytic function of S. With the singularities removed, the matrix D(S)Z(S) assumes a block diagonal form as the entries where vi · x j + vi,0 is not a pole become zero. To complete the proof, we need each block in the block diagonal matrix to have full rank. To do that, we need only choose the x j ’s within each block such that they are linearly independent and avoid a finite number of hyperplanes that define the poles for the other hidden units. This is always possible since the input dimension is n and each block is of size n × n. We can use this result to prove the following modification of Theorem 13 in the earlier article. The bound obtained is (k − 1)(n − 1) instead of (k − 1)(n + 1) + 1.
768
Wee Sun Lee, Peter L. Bartlett, and Robert C. Williamson
x 1
x2
..
.. .
w1
x n-1
wk xn
v kn
Figure 1: Network with weights deleted.
Theorem 3. Let Fθ be the class of two-layer feedforward networks (thresholded at zero at the output) with k hidden units with tanh activation, input space X: = {(x1 , . . . , xn ) ∈ Rn : |xi | < C} (where C is a constant greater than zero), and k(n + 2) + 1 weights restricted to an open set, which includes the origin. Then VCdim(Fθ ) ≥ (k − 1)(n − 1). Proof. Delete all the weights from the nth input and all the weights connected to the kth hidden unit except the weight connecting the two of them, as shown in Figure 1. Let the smaller network (with weights deleted) be f (w, x) = w0 +
k−1 X
wi tanh(v0i · x0 + vi0 )
i=1
+wk tanh(vkn xn ) = g(w0 , x0 ) + wk tanh(vkn xn ), where w0 = (v10 , . . . , vk−1,n−1 , w0 , . . . , wk−1 ), v0i = (vi1 , . . . , vi,n−1 ), and x0 = (x1 , . . . , xn−1 ). We will also fix wk and vkn to be nonzero constants. Now, g(w0 , x0 ) is a network with n − 1 inputs and k − 1 hidden units (the network within the box in Fig. 1). We have shown that for almost all sets of (k − 1)(n − 1) real inputs, the rank of the matrix (1) is at least (k − 1)(n − 1) as long as no two hidden units have the same set of weights. If we choose the outputs for g(w0 , ·) in the range of h: xn 7→ −wk tanh(vkn xn ) (this can be
Correction
769
done since we can choose x0 as small as we like), the inputs in g(w0 , ·) can be made to belong to the boundary by choosing xn appropriately. Hence inputs such that the matrix in Lemma 9 have rank (k − 1)(n − 1) exist (almost all sets of n − 1 dimensional inputs of size (k − 1)(n − 1) that map g(w0 , ·) to the range of h: xn 7→ −wk tanh(vkn xn ), and the corresponding xn ’s will have this property). Lemma 9 then gives the desired result. Acknowledgments We thank Adam Kowalczyk, Yossi Erlich, and Dan Chazan for pointing out the error. References Albertini, F., Sontag, E. D., & Maillot, V. (1993). Uniqueness of weights for neural networks. In R. Mammone (Ed.), Artificial Neural Networks for Speech and Vision (pp. 115–125). London: Chapman and Hall. Bartlett, P. L., & Dasgupta, S. (1995). Exponential convergence of a gradient descent algorithm for a class of recurrent neural networks. In Proceedings of the 38th Midwest Symposium on Circuits and Systems.
Received January 15, 1996; accepted July 12, 1996.
NOTE
Communicated by Ronald Meir
Lower Bound on VC-Dimension by Local Shattering Yossi Erlich∗ Engineering Department, Technion-Israel Institute of Technology, Haifa, Israel
Dan Chazan IBM Science and Technology Center, Haifa, Israel
Scott Petrack Vocaltec Ltd., Herzeliya, Israel
Avraham Levy Mathematics Department, Technion-Israel Institute of Technology, Haifa, Israel
We show that the VC-dimension of a smoothly parameterized function class is not less than the dimension of any manifold in the parameter space, as long as distinct parameter values induce distinct decision boundaries. A similar theorem was published recently and used to introduce lower bounds on VC-dimension for several cases (Lee, Bartlett, & Williamson, 1995). This theorem is not correct, but our theorem could replace it for those cases and many other practical ones. 1 Introduction Many classification schemes use some kind of smoothly parameterized function class, from which one classifier is chosen, according to some learning scheme. The so-called learnability of the classifier’s class is often described by the VC-dimension of the classifier’s class. Therefore, lower or upper bounds of such VC-dimensions are of great interest. Recently a theorem for a lower bound on the VC-dimension of a smoothly parameterized function class was used to find a lower bound for some interesting cases (Theorem 10 of Lee et al., 1995). The theorem states that the VC-dimension of a parametric classifiers family is bounded by the dimension of any manifold in the classifier parameter space, where any two sets of parameters values in this manifold result in a different decision boundary. Korian and Sontag (1996) showed that for the case of neural networks, the VC-dimension is lower bounded by an order of the square of the number of weights. However, Theorem 10 of Lee et al.
∗ Current
address: T. J. Watson Center, IBM, Yorktown Heights, NY.
Neural Computation 9, 771–776 (1997)
c 1997 Massachusetts Institute of Technology °
772
Yossi Erlich, Dan Chazan, Scott Petrack, and Avi Levy
(1995) is of great interest since it is more global and is not an asymptotic bound. In order to obtain a corrected theorem, it was necessary to modify the definition of the allowable parameterizations of the decision sets. In Lee et al.’s article, any function h(a, x) defines a classification of x into two classes s1 = {x: h(a, x) > 0} and s2 = {x: h(a, x) < 0}. In addition, the function h(a, x) is assumed to have the form g(a, y) + z where x = (y, z). Furthermore, the differentiability with respect to the classified space could be reduced to a convexity assumption. On the other hand, differentiability with respect to the parametric space A is critical; we cannot demand continuity only, since a continuous map does not preserve the dimension. For instance, it is possible to define a continuous invertible scalar function h: Rk → R, which is used to classify x ∈ R by the sign of x − h(a), where a ∈ Rk is the classifier parameter. In this case, VC-dimension equals 1, regardless of the dimension k of the parameter space Rk , although boundary uniqueness exists. In an additional work, Lee et al. (1997) propose a correction to their original article that bypasses the difficulty with their original theorem. They do this by a direct proof of special cases of the general theorem within the context of a specific application. Definition 1 (Shattering, VC-dimension). Let F be a set of classifiers over a set X. A finite subset of X is said to be shattered by F if for any dichotomy on it, there exists a classifier in F that realizes it. The VC-dimension of F is the size of the maximal shattered set. Following the idea of Lee et al. (1995), the theorem is proved by showing that there is a set of k points that are locally shattered. By local shattering we mean that there exists an arbitrarily small open set in the parameter space, which still shatters X. The theorem for the case of a single classifier decision boundary is given in section 2. A more general theorem for the case of countable decision boundaries is stated in Erlich (1996). 2 The Theorem Theorem. Let A be an open subset of Rm (the parameter space). Let X be a subset of Rn (the feature space). If there exists a subset Xˆ of X, and a kdimensional manifold M of A, such that the following assumptions holds then the VC-dimension is at least k: • There exists a parametric representation Y × Z for Xˆ such that Y ⊆ R(n−1) , Z ⊆ R, and Xˆ ⊆ Y × Z. (We will represent x ∈ Xˆ by x = (y, z), where y ∈ Y, z ∈ Z.)
Lower Bound on VC-Dimension
773
ˆ ∀0 ≤ α ≤ 1, (y, αz1 + • Xˆ is convex for every y ∈ Y. (∀(y, z1 ), (y, z2 ) ∈ X, ˆ (1 − α)z2 ) ∈ X. • Let g: M × Y → R be a continuously differentiable function in M that represents a boundary in Xˆ in the following sense: For every a ∈ M, the classification of x = (y, z) ∈ Xˆ is determined by the sign of z − g(a, y). For a zero argument, the value of the sign function can be arbitrarily chosen. This will have no effect on the proof of the theorem. • The boundaries are unique in M. (For all a1 6= a2 ∈ M, exists y ∈ Y, ˆ and g(a1 , y) 6= g(a2 , y).) such that (y, g(a1 , y)), (y, g(a2 , y)) ∈ X, Proof Outline. Following Lee et al. (1995), the gist of the proof involves showing that there exists a parameter value a and k points y1 , . . . , yk on the boundary defined by g(a, yi ) = zi , such that {∇a g(a, yi )}—the gradients of g at (a, yi ) w.r.t. a—are linearly independent. Therefore all dichotomies can be obtained by parameter values obtained by infinitesimal variations of the parameter a (following the ideas of Lee et al., 1995). The proof proceeds by induction. At each step of the induction there exist a point al and l points y1 , . . . , yl , such that the gradients {∇a g(a, yi )} span an l-dimensional space. It follows from this that within the submanifold Ml defined by g(a, yi ) = zi i = 1, . . . , l, there exists a neighborhood of al for which the gradients {∇a g(a, yi )} also span an l-dimensional space. It is then shown that a new point al+1 may be found within this neighborhood on the manifold for which the gradient {∇a g(al+1 , yl+1 )} is not perpendicular to the submanifold Ml . Because the submanifold is orthogonal to the gradients {∇a g(a, yi )} by definition (since g(a, yi ) is constant within this manifold), it follows that ∇a g(al+1 , yl+1 ) is not in the space spanned by {∇a g(al+1 , yi )}. For this reason, yl+1 may be adjoined to yl to regenerate the induction hypothesis. In this way, a point ak is found at the end of the induction for which there exist yi , i = 1, . . . , k such that the gradients {∇a g(ak , yi )} are all independent. Proof. Choose two classifiers whose parameters are a1 and a2 from the parametric k-dimensional manifold M. By the uniqueness condition, we ˆ and must be able to find y1 ∈ Y such that (y1 , g(a1 , y1 )), (y1 , g(a2 , y1 )) ∈ X, g(a1 , y1 ) 6= g(a2 , y1 )). Without a loss of generality, let g(a1 , y1 ) < g(a2 , y1 ). We choose a smooth curve in M (continuously differentiable) between those two classifiers, h: R → M, such that h(0) = a1 and h(1) = a2 . Define the function f : R → R by f (t) = g(h(t), y1 ). f is continuously differentiable in [0, 1], and it satisfies f (0) < f (1). Therefore, there exists 0 < t0 < 1, such that f 0 (t0 ) 6= 0 and f (0) < f (t0 ) < f (1). Define a0 = h(t0 ), and z1 = g(a0 , y1 ). By ˆ the convexity condition of Xˆ for constant y, we conclude that (y1 , z1 ) ∈ X. By the chain rule: f 0 (t0 ) = (∇a g(a0 , y1 ))T h0 (t0 ),
(2.1)
774
Yossi Erlich, Dan Chazan, Scott Petrack, and Avi Levy
where ∇a g(a, y1 ) is the gradient of g with respect to a. Since f 0 (t0 ) 6= 0 we conclude that ∇a g(a0 , y1 ) is not perpendicular to the k-dimensional manifold M. Therefore, there exists an open set A1 ⊆ A containing a0 , such that the set M1 = {a ∈ A1 ∩ M | g(a, y1 ) = z1 }
(2.2)
is diffeomorphic to a (k − 1)-dimensional ball (this follows directly from the implicit function theorem). ˆ Let us summarize what we have done so far. We have found (y1 , z1 ) ∈ X, and an open set A1 in the k-dimensional manifold M, in which z1 = g(a, y1 ) defines a (k − 1)-dimensional manifold in M, named M1 . ˆ a new point We can continue in the same way and find (y2 , z2 ) ∈ X, a0 ∈ M1 , and an open set A2 containing a0 in M1 , in which z2 = g(a, y2 ) defines a set, namely, M2 , which is diffeomorphic to a (k − 2)–dimensional ball. Specifically M2 is defined by M2 = {a ∈ A2 ∩ M1 | g(a, y2 ) = z2 }.
(2.3)
We further limit the open set A2 to one in which ∇a g(a, y2 ) is not spanned by ∇a g(a, y1 ) . This is possible since it is satisfied at the new a0 (∇a g(a0 , y1 ) is orthogonal to M1 ) and due to the continuity of ∇a g(a, ·). Continuing this way, we get a set of k points (yi , zi ) ∈ Xˆ for i = 1, . . . , k
(2.4)
and an open set Ak ∈ A, such that the set Mk = {a ∈ Ak ∩ Mk−1 | g(a, yk ) = zk }
(2.5)
is a zero-dimensional ball, containing a single point (name this point by aˆ ), where for every i = 2, . . . , k, ∇a g(ˆa, yi ) is not spanned by {∇a g(ˆa, y1 ), . . . , ∇a g(ˆa, yi−1 )}. Define B as a k × m matrix whose rows are ∇a g(ˆa, yi ). Therefore, B is of full rank. An almost identical proof to that of Lemma 9 in Lee et al. (1995) shows that the VC-dimension is at least k. The classifier decision boundary definition is different from the definition in the original theorem. The motivation to the changes is given by Erlich (1996). The above theorem can be easily applied to many cases—for instance, the cases that were considered by Lee et al. (1995)—two-layered neural network with activation functions satisfying the property IP, where tanh is one possibility (Theorem 17 of Lee et al., 1995) or radial basis function as activation function, using gaussian basis (Theorem 19 of Lee et al., 1995). In
Lower Bound on VC-Dimension
775
all those cases, boundary uniqueness is proved by introducing the boundary in an explicit form, for which one coordinate in X (could be referred to as Z) is given as a function (in a k-dimensional manifold) of the others (could be referred to as Y). Therefore for all those cases, the theorem may be easily applied. The theorem could be used to find lower bound of the VC-dimension for HMM hidden Markov model–based classifiers (Erlich, Merhav, & Chazan, in preparation). Yet there are cases where this theorem is not easily applied. For instance, if X = R, the theorem could be applied to cases were the boundary is no more than a single point, and consequently k is no more than 1. Another example is a classification of x = (x1 , x2 ) ∈ R2 by the sign of (x2 +y2 −a1 )(x2 +y2 −a2 ), where a = (a1 , a2 ) ∈ R2 is the parameter space. In this case, simple utilization of the theorem could lead to 1 as a lower bound of the VC-dimension. It is desirable to be able to obtain easily the tight bound 2, which could be obtained by the spirit of the theorem. These issues are resolved by a more general theorem concerning multiple decision boundaries (Erlich, 1996). In this generalized theorem, a countable set of boundaries is considered. We will not give a detailed proof for this generalization, since it is a simple extension of the previous one. For many cases, this lower bound is not tight, typically (although not necessarily) for cases where the shattering is not local. (For instance, it is easy to construct classes of classifiers, defined by a single parameter, whose VC-dimension is infinite).1 Acknowledgments We thank Wee Sun Lee, Peter Bartlett, and Bob Williamson for private correspondence concerning their work and Ronny Meir for his helpful comments. References Erlich, Y. (1996). On HMM based speech recognition using the MCE approach. M.Sc. thesis, Technion-Israel Institute of Technology. (Note: written in Hebrew; abstract in English.) Erlich, Y., Merhav, N., & Chazan, D. (1996). On the MCE approach for training HMM based speech recognition systems. Unpublished manuscript. Korian, P., & Sontag, E. (1996). Neural networks with quadratic VC dimension. In D. S. Touretzky, M. C. Moser, & M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems (vol. 8, pp. 197–203). Cambridge, MA: MIT Press. Lee, W. S., Bartlett, P. L., & Williamson, R. C. (1995). Lower bounds on the VC dimension of smoothly parameterized function classes. Neural Computation, 7, 1040–1053. 1
For an example to such a feedforward neural net, refer to Sontag (1992).
776
Yossi Erlich, Dan Chazan, Scott Petrack, and Avi Levy
Lee, W. S., Bartlett, P. L., & Williamson, R. C. (1997). Correction to “Lower Bounds on the VC Dimension of Smoothly Parameterized Function Classes.” Neural Computation, 9, 765–769. Sontag, E. (1992). Feedforward nets for interpolation and classification. J. Comp. Syst. Sci., 45, 20–48.
Received January 10, 1996; accepted September 5, 1996.
Communicated by Shimon Edelman
SEEMORE: Combining Color, Shape, and Texture Histogramming in a Neurally Inspired Approach to Visual Object Recognition Bartlett W. Mel Department of Biomedical Engineering, University of Southern California, Los Angeles, California 90089 USA
Severe architectural and timing constraints within the primate visual system support the conjecture that the early phase of object recognition in the brain is based on a feedforward feature-extraction hierarchy. To assess the plausibility of this conjecture in an engineering context, a difficult three-dimensional object recognition domain was developed to challenge a pure feedforward, receptive-field-based recognition model called SEEMORE. SEEMORE is based on 102 viewpoint-invariant nonlinear filters that as a group are sensitive to contour, texture, and color cues. The visual domain consists of 100 real objects of many different types, including rigid (shovel), nonrigid (telephone cord), and statistical (maple leaf cluster) objects and photographs of complex scenes. Objects were individually presented in color video images under normal room lighting conditions. Based on 12 to 36 training views, SEEMORE was required to recognize unnormalized test views of objects that could vary in position, orientation in the image plane and in depth, and scale (factor of 2); for nonrigid objects, recognition was also tested under gross shape deformations. Correct classification performance on a test set consisting of 600 novel object views was 97 percent (chance was 1 percent) and was comparable for the subset of 15 nonrigid objects. Performance was also measured under a variety of image degradation conditions, including partial occlusion, limited clutter, color shift, and additive noise. Generalization behavior and classification errors illustrate the emergence of several striking natural shape categories that are not explicitly encoded in the dimensions of the feature space. It is concluded that in the light of the vast hardware resources available in the ventral stream of the primate visual system relative to those exercised here, the appealingly simple feature-space conjecture remains worthy of serious consideration as a neurobiological model. 1 Introduction The human visual system can recognize unprimed views of common objects at sustained rates in excess of 10 per second (Potter, 1976; Subramaniam, Neural Computation 9, 777–804 (1997)
c 1997 Massachusetts Institute of Technology °
778
Bartlett W. Mel
Biederman, Kalocsai, & Madigan, 1995). In neurophysiological terms, the implication of this astonishing fact is that an object’s identity can be computed by a 100-ms-wide time slice of neural activity propagating through the recognition “pipe” from the retina to the higher reaches of the visual system. At typical firing rates of visual neurons, we may surmise that under these special conditions, each neuron in the visual pathway devotes fewer than a dozen action potentials to the processing of any single object’s image. While these figures bear primarily on the throughput of the recognition pipe (objects per second), a recent electroencephalogram study reveals that the length of the pipe from retina to recognition is probably no longer than 155 ms (DiGirolamo & Kanwisher, 1995). These results are also roughly consistent with known response latencies of neurons in the object recognition areas of inferotemporal cortex, which range around 100 ms (Oram & Perrett, 1992; Gochin, Colombo, Dorfman, Gerstein, & Gross, 1994). Beyond such feats of sheer speed, the human visual system easily recognizes a wide variety of stimuli, including rigid, articulated, and entirely nonrigid two-dimensional (2D) and three-dimensional (3D) objects, faces, statistical or fractal objects, surface textures, and views of complex natural scenes, while displaying remarkable insensitivity to changes in viewpoint, partial occlusion, and clutter. How can a visual system work so well and so fast? The circumstances of biological vision suggest a solution of low time complexity (few steps) but high space complexity (many parallel processes)—synaptic connections among neurons in the cortical object recognition pathway probably number between 10 and 100 trillion.1 One appealingly simple notion is that the visual system is organized as a feedforward feature-extracting hierarchy that builds progressively more complex and viewpoint-invariant features useful for identifying objects, where invariance over a group of transformations is achieved by summing over viewpoint-specific elemental representations (Pitts & McCullough, 1947). This class of algorithm has been brought to bear with great success, for example, in the realm of optical character recognition (Fukushima, Miyake, & Ito, 1983; LeCun et al., 1990). According to this view, and in correspondence with the axioms of statistical pattern recognition, visual object recognition in the brain is the process of mapping retinal pixels into a feature space that is better suited (than pixels) to the viewpointinvariant classification and identification problems faced by visual animals. Within this feature space, represented by the activity of neurons at the top of the hierarchy, the similarity of one object view to another is given by a simple (e.g., Euclidean) distance calculation; recognition of an input is achieved by finding the identity or class of the most similar training view previously
1 This assumes 1014 to 1015 synapses in the brain, 90 percent of these due to the cerebral cortex, one-third of these due to the visual system, and one-third of these accounting for the “what” pathway.
SEEMORE
779
stored in memory. Feature dimensions are chosen such that large changes in object pose produce relatively small excursions in feature space, while small changes in object “quality” (shape, texture, color) produce relatively large excursions in feature space. Where 3D objects or scenes are to be recognized over large regions of the viewing sphere, a small set of reference views is stored during learning, leading to a “view-based” approach (Edelman & Bulthoff, 1992; Logothetis, Pals, Bulthoff, & Poggio, 1994; Murase & Nayar, 1995). In evaluating the plausibility of this feedforward feature-extraction scenario as a model for biological vision, it is useful to consider several desirable properties of a feature-space representation as dictated by the “computational ecology” of natural vision. In particular, we might expect the brain to construct visual features that are: 1. Large in number, relating to the fact that sparse, high-dimensional feature representations provide fundamental advantages for fast, inexpensive recognition, especially large signal-to-noise ratios that allow objects to be recognized prior to explicit segmentation (Califano & Mohan, 1994; Kanerva, 1988). 2. Useful, that is, high-level features should be relatively sensitive to object quality—and hence identity—but relatively insensitive to an object’s pose or configuration. 3. Dominated by spatially localized measures, relating to the need to cope with nonrigid object transformations, which often preserve local but not global structure; the need to cope with object textures, defined in large part by local relative-orientation structure; and the need to cope with occlusion and clutter, which are least disruptive to an object’s internal code when derived from features with localized support. 4. Driven by multiple visual cues, relating to the need to maximize object discrimination power by using all available visual cues, the need to represent objects of many different types richly, and the need to buffer the visual representation of objects or scenes against a variety of forms of image degradation, to which different visual cues are by nature differentially sensitive. In this context, recent neurophysiological results (Kobatake & Tanaka, 1994; Logothetis et al., 1994) that have extended classical results from other groups (Gross, Rocha-Mironda, & Bender, 1972; Perrett, Rolls, & Caan, 1982; Schwartz, Desimone, Albright, & Gross, 1983; Desimone, Albright, Gross, & Bruce, 1984) are intriguing, as they demonstrate a substantial population of neurons in the “object recognition areas” of the primate visual system (Mishkin, 1982) that respond best to specific complex minipatterns (e.g., localized conjunctions of contour, texture, and color elements). In many cases, these neurons exhibit considerable insensitivity to changes in viewpoint-
780
Bartlett W. Mel
related parameters, such as stimulus position and scale (Ito, Tamura, Fujita, & Tanaka, 1995), while remaining selective for their preferred minipatterns. Such empirical data are thus on their face suggestive of a feature-extraction hierarchy designed to cope with known ecological pressures. However, serious theoretical concerns persist regarding the limitations of bottom-up feature-space approaches, cropping up under a variety of guises: 1. Feature-space methods are essentially template-matching methods and so require an intractable number of templates to cope with inputs that have been rotated, scaled, partly occluded, nonrigidly transformed, or presented under varying lighting conditions. 2. Feature-space approaches lack a top-down component essential for resolving featural ambiguities present in real images. 3. Feature-space methods do not scale well to high dimensions (i.e., large numbers of features), either because it is not practical to learn or otherwise assemble a sufficiently large number of sufficiently useful features, too many dimensions of noise necessarily overwhelm too few dimensions of signal, or high-dimensional methods are computationally intractable or require too much data, even for a brain. Most of these concerns have been addressed in recent years, as featurespace approaches and their variants have been applied with success in a variety of object recognition domains (Fukushima et al., 1983; Swain and Ballard, 1991; Viola, 1993; Lades et al., 1993; Califano & Mohan, 1994; Murase & Nayar, 1995; Rao & Ballard, 1995; Amit, Geman, & Wilder, 1995; Schiele & Crowley, 1996). However, the conjecture that a feature-space approach based on feedforward receptive-field-style computations could account for the prodigious recognition capacities of the primate brain as yet lacks direct support in the modeling literature. Existing approaches have generally involved one or more assumptions that place them squarely outside the biological “paradigm” for general-purpose visual recognition, such as use of small object corpus (typically no more than a few dozen objects), limited range of object types (e.g., faces or rigid volumetric objects), feature computations not amenable to receptive-field-style computations (e.g., use of sophisticated geometric invariants), or strong viewpoint assumptions, or corresponding explicit image prenormalization operations (typical in optical character recognition and face recognition). Against this backdrop, SEEMORE was developed to assess more directly the plausibility of the feature-space account as a biological model for rapid, general-purpose object recognition. Key design goals prescribed a visual representation that (1) relied exclusively on geometrically simple receptivefield-style computations, (2) operated directly on input images without shift, scale, or other explicit object prenormalization steps, (3) could cope with a large number of real 3D objects of many different types, (4) could recognize objects over 6 degrees of freedom of viewpoint, gross nonrigid shape dis-
SEEMORE
781
tortions, and partial occlusion, and (5) exhibited a significant capacity for generalization across natural object categories. SEEMORE’s visual representation is composed of 102 feature channels that emphasize spatially localized filter computations and are collectively sensitive to contour shape, color, and texture cues. SEEMORE’s architecture is similar in spirit to the color histogramming approach of Swain and Ballard (1991) but includes spatially structured features that also provide for shape-based generalization. Experiments reveal good recognition performance in a viewpoint- and configuration-invariant 3D object recognition problem with 100 objects, including objects that are rigid, nonrigid, and “statistical” in nature and photographs of complex scenes. Perhaps most interesting, SEEMORE’s patterns of generalization reveal the emergence of several striking object catagories that are not explicitly encoded in the 102 feature-space dimensions. (A short report describing this work appeared in Mel, 1996.) 2 Methods 2.1 SEEMORE’S Visual World. SEEMORE’s database contains 100 common 3D objects and photogaphs of scenes, each represented by a set of presegmented color video images (see Figure 1). The training set consisted of 12 to 36 views of each object as follows. For rigid objects, 12 training views were chosen at roughly 60-degree intervals in depth around the viewing sphere (see Figure 2A), and each view was then scaled to yield a total of three images at 67 percent, 100 percent, and 150 percent. Image plane orientation was allowed to vary arbitrarily. For nonrigid objects, 12 training views were chosen in random poses (see Figure 2B). During a recognition trial, SEEMORE was required to identify novel test images of the database objects. For rigid objects, test images were drawn from the viewpoint interstices of the training set, excluding highly foreshortened views (e.g., bottom of can). Each test view could therefore be presumed to be correctly recognizable but never closer than roughly 30 degrees in orientation in depth or 22 percent in scale to the nearest training view of the object, while position and orientation in the image plane could vary arbitrarily. For nonrigid objects, test images consisted of entirely novel random poses. Each test view depicted the object presented alone on a smooth background. In some recognition trials, test views were systematically degraded. The five degradation conditions were (1) scrambled, in which the object view was cut up and the pieces rearranged and reflected (see Figure 3A), (2) occluded, in which 50 percent of the object view was blacked out (see Figure 3B), (3) cluttered, in which a second object—a randomly colored lowercase character— was superimposed onto the margin of the image to add limited pattern clutter while minimizing occlusion (see Figure 3C), (4) colorized, in which two of the three RGB channels were scaled either up or down by 30 per-
782
Bartlett W. Mel
Figure 1: The database contains 100 objects of many different types, including rigid (soup can), nonrigid (necktie), statistical (bunch of grapes), and photographs of complex indoor and outdoor scenes.
cent while maintaining constant intensity (see Figure 3D), and (5) noisy, in which the images were subjected to uniform additive pixel noise with mean 30 percent of the maximum pixel value (see Figure 3E). 2.2 Feature Channels. SEEMORE’s internal representation of a view of an object is encoded by a set of feature channels. The ith channel is based on an elemental nonlinear filter fi (x, y, θ1 , θ2 , . . .), parameterized by position in the visual field and zero or more internal degrees of freedom (see Figure 4). Each channel is by design relatively sensitive to changes in the image that are strongly related to object identity, such as the object’s shape, color, or
SEEMORE
783
A
B
Figure 2: (A) Training views of rigid objects were sampled uniformly about the 3D viewing sphere. (B) Training views of nonrigid objects were drawn at random from the object’s configuration space.
texture, while remaining relatively insensitive to changes in the image that are unrelated to object identity, such as are caused by changes in the object’s pose. In practice, this invariance is achieved in a straightforward way for each channel by subsampling and summing the output of the elemental channel filter over the entire visual field and onePor more of its internal degrees of freedom, giving a channel output Fi = x,y,θ1 ,... , fi (). For example, a particular shape-sensitive channel might “look” for the image-plane projections of right-angle corners, over the entire visual field, 360 degrees of rotation in the image plane, 30 degrees of rotation in depth, one octave in scale, and tolerating partial occlusion or slight misorientation of the elemental contours that define the right angle. In general, then, Fi may be viewed as a “cell” with a large receptive field whose firing rate is an estimate of the number of occurrences of distal feature i in the visual work space over a large range of viewing parameters. SEEMORE’s architecture consists of 102 feature channels, whose outputs form an input vector to a nearest-neighbor classifer (see Figure 5). Following the design of the individual channels, the channel vector F = {F1 , . . . , F102 } is (1) insensitive to changes in image plane position and orientation of the object, (2) modestly sensitive to changes in object scale, orientation in depth, or nonrigid deformation, but (3) highly sensitive to object quality as pertains to object identity. Within this representation, total memory storage for all views of an object ranged from 1224 to 3672 integers. As shown in Figure 6, SEEMORE’s channels fall into in five groups: (1) 23 color channels, each of which responds to a small blob of color parameterized by “best” hue and saturation, (2) 11 coarse-scale intensity corner channels parameterized by open angle, (3) 12 “blob” features, parameter-
784
Bartlett W. Mel
A
B
C
D
E
Figure 3: Sensitivity of recognition performance to various forms of image degradation was studied. Examples of the five degradation conditions are shown: (A) scrambled, (B) occluded, (C) cluttered, (D) colorized, and (E) noisy.
SEEMORE
785
To Classifier Fi output of channel i
Fi =
fi
x,y,
local features for channel i
fi (x, y,
Camera Image
Figure 4: The ith feature channel is based on an elemental nonlinear filter fi (x, y, θ1 , . . .), which is subsampled across the image, at a range of image-plane orientations, and over any of the filters’ additional internal degrees of freedom. The output of the ith channel Fi is the sum over all elemental filter outputs within that channel.
ized by the shape (round and elongated) and size (small, medium, and large) of bright and dark intensity blobs, (4) 24 contour-shape features, including straight angles, curve segments of varying radius, and parallel and oblique line combinations, and (5) 16 shape- and texture-related features based on the outputs of Gabor functions at five scales and eight orientations. The implementations of the channel groups were crude, in the interest of achieving a working, multiple-cue system with minimal development time. Images were grabbed using an off-the-shelf Sony S-Video Camcorder and SunVideo digitizing board that provided 11 color bits per pixel (YUV); images were
786
Bartlett W. Mel
Classifier
F1
F2
F3
F4
Fn
Camera Image
Figure 5: SEEMORE’s architecture consists of a set of channels that provides input to a nearest-neighbor classifier. Since each channel is by itself insensitive to translation and rotation in the image plane and is slowly varying as an object rotates in depth, the output vector F for the complete set of channels is similarly invariant to these transformations. However, F remains highly sensitive to changes in object quality, and hence object identity.
converted to RGB format for all subsequent processing. Implementation details of the 102 feature channels are described below. 2.2.1 Twenty-three Color Channels. The image was first low-pass-filtered by two octaves using a five-element separable mask (1/16, 1/4, 3/8, 1/4, 1/16), and subsampled to a size of 120 × 120 color pixels. In the following, for R, G, √ B ∈ [0, 1], hue, saturation, and intensity (HSI) are defined as H = arctan( 3(G − B)/((R − G) + (R − B))), S = 1.0 − min(R, G, B)/I, I = (R + G + B)/3, and sigmoids g and h with threshold θ and gain s are defined as g(x, θ, s) = 1/(1+exp((θ −x)∗s)), and h = 1− g. Of the 23 color channels, 22 (11 hues at 2 levels of saturation) were derived as follows. Around the hue circle, 1D “receptive field” centers were placed at 11 evenly spaced hues, including 0 degree. Response profiles huei , i ∈ {1, . . . , 11} were triangular and symmetrical, returning a peak score of 1.0 in response to the center hue, with a linear decay to 0 over a 45-degree radius. The 11 high- and lowsaturation unit pairs shared the same hue centers. The high-saturation unit were computed as a product of the hue score and two sigmoidal outputs yhisat i factors that (1) fell to 0 as the saturation crossed into the range of the lowsaturation units and (2) fell to 0 at low-intensity values where chromaticity
SEEMORE
Colors
787
Blobs
Angles
Contours
Gabor-Based Features
energy @ scale i
sin2 + cos2
energy variance @ scale i
0 r
r
45 90
<30 >30
Figure 6: SEEMORE’s 102 channels fall into five groups: (1) 23 circular hue/saturation channels, (2) 11 coarse-scale intensity corner channels, (3) 12 circular and oriented-intensity blobs, (4) 24 contour-shape features, including curves, junctions, and parallel and oblique contour pairs, and (5) 16 orientedenergy and relative-orientation features based on the outputs of Gabor functions at several scales and orientations.
788
Bartlett W. Mel
information is unreliable, giving yhisat = huei · g(S, 0.4, 10) · g(I, 0.2, 20). The i were computed as a product of the hue low-saturation unit outputs ylowsat i score and three sigmoidal factors—the additional sigmoid was needed to = bound the saturation valued both from above and below—giving ylowsat i huei · h(S, 0.4, 10) · g(S, 0.15, 20) · g(I, 0.2, 20). A separate white unit selected for pixels of high intensity and low saturation, with an output given by ywhite = g(I, 0.6, 20) · h(S, 0.15, 20). Black pixels were ignored. Each of the 23 color functions was evaluated at every pixel in the image; any response value exceeding 0.1 generated a count within a histogram containing 23 bins, one for each “color.” 2.2.2 Eleven Intensity Corner Channels. The image was first low-passfiltered and subsampled by two octaves to a size of 120 × 120 intensity pixels. A crude-oriented intensity edge detector was run in 12 increments of 30 degrees around the circle centered at each pixel; the length of the edge was approximately 20 pixels when mapped back into the full-resolution image (480 × 480). The criterion for detection of an oriented edge was twofold: (1) An intensity gradient of a consistent sign should exist across the edge at multiple sites along the length of the edge, and (2) the mean deviation of intensity (from its average) should be small along the length of the edge on both sides. Both criteria were computed based on pairs of pixels symmetrically straddling the edge along its length, as shown in Figure 7. The numerical score for an edge defined in terms of left and right pixels from three pixel pairs inP Q dexed by i was given by y = i g(|IiR −IiL |, 0.08, 30)−λ i (|IiR −IiR |+|IiL −IiL |), where λ = 0.8 controlled the relative importance of the gradient versus smoothness constraints. The result was passed through a hard binary threshold giving 1 for y > 0, 0 otherwise. The 12 oriented binary edge maps were used to find intensity corners, which were sought at every pixel. Corners were parameterized only by open angle (from 30 to 330 degrees in 30-degree increments), yielding 11 corner types and 11 corresponding histogram bins. A corner was scored whenever two edge segments of the proper relative orientation and offset were found in the image at any absolute orientation, and the corresponding histogram bin was incremented. 2.2.3 Twelve Intensity Blob Channels. The image was first low-pass-filtered and subsampled by two, three, and four octaves to yield image sizes of from 120 × 120 to 30 × 30 intensity pixels. The operations described below were carried out at each of the three scales. An intensity blob consisted of a set of intensity gradients of the proper sign in an elliptical configuration about a central pixel. They were thus elliptical versions of the straight edge segments of Figure 7, based on several sigmoidally modulated pixel pair intensity differences. However, in lieu of three gradient terms straddling a straight edge combined multiplicatively, a sum of 12 gradient terms
SEEMORE
789
cross-edge difference
along-edge smoothness L1 L2
R1
L3 R2 R3
Figure 7: Straight-oriented edge segments were detected in the intensity image based on (1) sign consistency of cross-edge intensity differences, measured at several sample pairs of pixels, and (2) along-edge intensity “smoothness,” measured as mean deviation in intensity on either side of putative edge.
across an elliptical contour was computed. A smoothness term was again given by the mean deviation of intensity along the inside and outside of the circular contour; this term was subtracted from the across-contour term with λ = 1.0. If a threshold of 4.0 was exceeded, then a blob was scored, and the corresponding histogram bin was incremented. Twelve total bins corresponded to light or dark, round or elongated intensity blobs, at three scales. 2.2.4 Forty Generalized Contour Channels. The image was low-pass-filtered as before but maintained at full resolution (480 × 480). An oriented edge detector was run that differed in three ways from that underlying the intensity corner channels. First, edges consisted of five pairs of across-edge pixels spanning a total length of approximately 18 original image pixels at full resolution, and were computed at 7.5 degree intervals around the halfcircle. Second, an edge was scored iff all five gradient terms exceeded a hard threshold (in lieu of a product of sigmoids), and no along-contour smoothness penalty term was used. However, the gradient threshold differed from
790
Bartlett W. Mel
edge to edge, given by an estimate of the along-edge gradient (average difference of end pixel values on both sides of edge). Third, an edge was scored if it existed according to the above criteria in any of the RGB channels, but neither the RGB channel nor the polarity of the edge was stored. As such, the output of the edge-detecting stage was a set of 24 binary contour maps representing orientations from 0 to 172.5 degrees, in 7.5-degree intervals at a resolution of 240 × 240 pixels. Every edge in the binary edge maps was “generalized” (copied) into a 2×2 block of neighboring pixels, in order effectively to relax the position tolerances in the subsequent compound-feature detection step. Forty compound contour features were then tested for at every pixel (240 × 240) and orientation (48 steps of 7.5 degrees). A compound feature consisted of six oriented contours with proper offsets and relative orientations; when at least five were present, the compound feature was scored, and its histogram bin was incremented. Various optimizations were used to speed computation, including heuristics for early rejection of both pixels and compound features. As shown in Figure 6, features included seven long, simple contours (straight plus 6 degrees of curvature), nine corner features with angles from 30 to 150 degrees, in 15-degree steps (analogous to intensity corners but with longer limbs, higher angular resolution, and tolerating up to one missing link), and parallel and oblique line pairs (12 each at a range of separations from 16 to 82 pixels in increments of 6 pixels). 2.2.5 Sixteen Gabor-Derived “Texture” Channels. The largest central square in the original image was reduced to 128 × 128 pixels. Forty Gabor filters were applied to the image at eight orientations and five scales using a Fast Fourier Transform (FFT) √ (scales varied from X pixels per cycle to Y pixels per cycle in powers of 2). Energy images were computed by squaring and summing corresponding pixels in the sine and cosine components of the Gabor output. The energy pixel values were then summed across all eight orientation images at each scale, giving the total oriented energy at each scale as the values of the first five texture channels. The next five texture channels consisted of the variances of the oriented energy at each scale, gotten by computing the mean squared deviation of the total energy at each orientation from the mean total energy for all orientations, for that scale. High variances signified images with energy distributed nonuniformly across the orientation channels, such as an image with one or two dominant orientations. The Gabor energy images were then passed through a fixed binary thresholding operation chosen for each scale to emphasize localization of perceptually relevant oriented image structure. The remaining six texture channels measured probability of relative orientation energy in the image, parameterized by orientation difference (0 degrees, 45 degrees, 90 degrees) and image distance (d < 30 pixels, d ≥ 30 pixels). All relevant suprathreshold pixel pairs were considered and used to generate counts in the six histogram bins.
SEEMORE
791
2.3 The Learning Rule. While nearest-neighbor classification techniques are remarkably powerful given their simplicity (Friedman, 1994), the problem of scaling feature dimensions in order to minimize classification error in high dimension remains an experimental art. The need for such optimization is particularly acute in cases, including the present case, where feature dimensions are grossly inhomogeneous in terms of their variances, entropy, and individual classification power and where strong, poorly characterized correlations exist among the features dimensions. The notion that features should be simply normalized by their variances (i.e., “sphering” the data) does not take account of the individual utility of the features for the classification problem at hand or of the correlations between features. One approach to this problem has been to project a high-dimensional feature space onto the low-dimensional subspace in which the exemplars of each class are as tightly clustered as possible while the class means are as widely dispersed as possible (e.g., Fisher’s linear discriminant; Duda & Hart, 1973). Where dimensionality reduction is not needed for computational or other reasons, classification in all available dimensions in principle allows for improved classification performance. Thus, related methods involve finding a “clustering transformation” of the input space that maximizes between-class distances while minimizing within-class distances (Fukunaga, 1990). In the approach here, an objective function that predicts classification performance using all dimensions is maximized using gradient ascent. The formulation is based on the standard assumption that, on average, two views of the same object are more similar to each other than two views of different objects. Thus, measurements are taken centered at every view j in the database, comparing the average distance from j to views of the same object with the average distance from j to views of different objects, where the comparison is normalized by a local measure of the dispersion of interclass distances. In this way, the mean distance to views of the same object can be treated as a z-score with respect to the distribution of distances to different objects (see Figure 8). The normalization is local to each view since the dispersion of object views can vary substantially in different regions of the feature space. Classification performance is expected to increase as these z-scores increase, averaged over the entire database of views. We begin by writing the weighted city-block distance between two views represented as N-dimensional vectors j, k as Djk = D(j, k, w) =
N X
i wi Djk ,
i=1 i = |ji − ki | is the distance where i is the feature index, wi is a weight, and Djk along single feature dimension i. The mean distance between view j and all views k of the same object, or of different objects, is written, respectively, as
D≡j = hDjk i, ∀k ≡ j
Probability
792
Bartlett W. Mel
distribution of distances from view j to views of same object
distribution of distances from view j to views of different objects
j
D
j
D
j
Figure 8: Graphical representation of D≡j , D6≡ j , and σ6≡i j . Measured distances from j to members of same object class (shown as solid lines) give rise to distribution 1 whose mean is D≡j . Similarly, measured distances from j to members of all other classes give rise to distribution 2 whose (larger) mean is D6≡ j . The predicted “goodness” of feature i in the neighborhood of view j, Gj , is then given by the z-score (D6≡ j − D≡j )/σ6≡ j .
or D6≡ j = hDjk i, ∀k 6≡ j, where the identity sign is used to indicate all views k of the same (≡ j) or different (6≡ j) object as j. Analogous mean distance quantities are defined for the single feature dimension i, that is, i Di ≡j = hDjk i, ∀k ≡ j
and i Di 6≡ j = hDjk i, ∀k 6≡ j .
The dispersion about view j of distances to views k of different objects is given by the standard deviation, for both the one-dimensional and Ndimensional cases, q i − Di 6≡ j )2 i, ∀k 6≡ j σ6≡i j = h(Djk or σ6≡ j =
q h(Djk − D6≡ j )2 i, ∀k 6≡ j .
SEEMORE
793
Both one-dimensional and N-dimensional feature “goodness” measures may then be defined in the neighborhood of view j as Gji =
Di 6≡ j − Di ≡j σ6≡i j
Gj =
D6≡ j − D≡j . σ6≡ j
and
The global N-dimensional feature goodness measure is then given by G = hGj i. G is maximized when, averaged over all views j in the database, the average distance to views of the same object is many standard deviations smaller than the average distance to views of different objects. G was optimized using gradient ascent, using the following weight-update rule: * i + σ6≡ j δG i i = [G − Gj · Cov6≡ j (D , D)] , ∀j 1wi ∝ δwi σ6≡ j j
(2.1)
where Cov6≡ j (Di , D) =
i h(Djk − Di 6≡ j )(Djk − D6≡ j )i, ∀k 6≡ j
σ6≡i j σ6≡ j
is the covariance of the one-dimensional distance (for feature i) and the weighted N-dimensional city-block distance from view j to views of different objects. Loosely speaking, the square bracketed term of equation 2.1 indicates that a feature’s weight tends to increase until its individual goodness accounts for a specific fraction of the global feature goodness, where the fraction is given by the covariance term. 2.4 Assessing Recognition Performance. SEEMORE’s recognition performance was assessed quantitatively as follows. A test set consisting of 600 novel views (100 objects × 6 views) was culled from the database, as previously described. In some cases, the raw images were preprocessed according to one of the five degradation conditions (scrambled, occluded, colorized, noisy, or cluttered). Then the intact or degraded images were presented to SEEMORE for identification. In the course of this work, it was noted empirically that a compressive transform on the feature dimensions (histogram values) led to improved classification performance; prior to all learning and recognition operations, therefore, each feature value was replaced by its natural logarithm (0 values
794
Bartlett W. Mel
were first replaced with a small positive constant to prevent the logarithm from blowing up). For each test view j, the distance Djt was computed for every training view t in the database, and the nearest neighbor was chosen as the best match.2 The logarithmic transform of the feature dimensions thus tied D to the ratios of individual feature values in two images rather than their arithmetic differences. 3 Results 3.1 Intact Images. Recognition time on a Sparc-20 was 1–2 minutes per view; the bulk of the time was devoted to shape processing, with under 2 seconds required for matching. Recognition results are summarized in Table 1, reported as the proportion of test views that were correctly classified. Performance using all 102 channels for the 600 novel object views in the intact test set was 96.7 percent; the chance rate of correct classification was 1 percent. Across recognition conditions, second-best matches usually accounted for approximately half the errors. Results were broken down in terms of the separate contributions to recognition performance of color-related versus shape-related feature channels. Performance using only the 23 color-related channels was 87.3 percent, and using only the 79 shape-related channels was 79.7 percent. Remarkably, very similar performance figures were obtained for the subset of 90 test views of the nonrigid objects, which included several scarves, a bike chain, necklace, belt, sock, necktie, maple leaf cluster, bunch of grapes, knit bag, and telephone cord. Thus, a novel random configuration of a telephone cord was as easily recognized as a novel view of a shovel. 3.2 Degraded Images. Results for the scrambled, occluded, colorized, noisy, and cluttered test sets are also shown in Table 1. Under the scrambling manipulation, overall recognition performance was only modestly affected, due primarily to the stability of the color channels under this manipulation, while the shape-related channels showed a significant drop in performance as expected (from 80 to 62 percent). When test views were half-occluded, recognition performance fell to 79 percent, due to the disruptive effect of occlusion on both color- and shape-related channels; the
2 In informal experimentation with a number of variants of the nearest-neighbor classifier, including a variety of different distance metrics, typically only small differences in performance were seen, and the reasons for these changes in performance were obscure. This work emphasized the power of the visual representation rather than the benchmarking of standard classifiers and their variants; given that nearest-neighbor classifiers are fast and conceptually simple, and are known to perform well often in high-dimensional spaces in comparison to more sophisticated classifiers (Friedman, 1994), this single method was used throughout the trials reported in this article.
SEEMORE
795
Table 1: Summary of Results. Intact Nonrigid Scrambled Occluded Cluttered Colorized Noisy Shape only Color only Color and shape
79.7 87.3 96.7
76.7 94.4 97.8
62.2 86.5 93.7
38.2 72.2 79.0
57.3 61.2 79.0
43.5 6.8 19.8
35.8 47.2 58.3
relatively severe disruption of the shape representation was in part due to the more spatially heterogeneous distribution of distinctive (i.e., necessary) shape cues and in part to the introduction of spurious contours and other accidental features at the occluding boundary. The color shift manipulation cut recognition performance to under 20 percent, an indication of the complete lack of color constancy in SEEMORE’s feature channels. Note that while shape-related channels did not explicitly indicate the color of origin of their respective features, they depended on both color and intensity and were thus affected by the colorization manipulation as well. The clutter condition, which simulated the presence of a distractor object, dropped correct recognition rate to just below 80 percent. The additive noise condition disrupted recognition performance along both shape and color axes, cutting overall performance to 58 percent, indicative of poor high-frequency noise tolerance in the underlying feature channels. 3.3 Generalization Behavior and Recognition Errors. Numerical indexes of recognition performance are useful but do not explicitly convey the similarity structure of the underlying feature space. A more qualitative but extremely informative representation of system performance lies in the sequence of images in order of increasing distance from a test view. Records of this kind are shown in Figure 9 for several recognition trials. In each, a test view is shown at the far left highlighted in red (gray), and the ordered set of nearest neighbors is shown to the right. When a test view’s nearest neighbor (second image from left) was not the correct match, the trial was classified as a recognition error in Table 1. In Figure 9A, generalization behavior can be seen when only color channels were used for recognition, illustrating the limits of a simple “color histogramming” system. Errors of this kind occurred for objects with very similar color content and were caused by changes in lighting conditions that tipped the balance in favor of an incorrect object match. In Figure 9B, generalization behavior can be seen when only shaperelated channels were used for recognition. Thus, as shown in row 1, a view of a book is judged most similar to a series of other books (or the bottom of a rectangular cardboard box)—each a view of a rectangular object
796
Bartlett W. Mel
A
Color Only
B
Shape Only
C
Color and Shape
Figure 9: Generalization and errors. In each row, a novel test view is shown at the left (outlined in red). The sequence of best matching training views (one per object) is shown to the right, in order of decreasing similarity. (A) Examples of color generalization and errors (i.e., excluding shape-related features). (B) Examples of shape and texture generalization and errors, excluding color features. (C) Examples of generalization and errors using all 102 color- and shape-related channels.
SEEMORE
797
with high-frequency surface markings. A similar sequence can be seen in subsequent rows for (2) a series of cans, each a right cylinder with detailed surface markings, (3) a series of smooth, not-quite-round objects, (4) a series of photographs of complex scenes, and (5) a series of dinosaurs (followed by a teddy bear). In certain cases, SEEMORE’s shape-related similarity metric was more difficult to interpret visually or verbalize (last two rows), or was substantially different from that of a human observer. When all 102 color and shape-related channels were used for recognition, the few remaining errors occurred among views of objects that shared common shape and color features (see Figure 9C). These errors plainly illustrate the limitations of SEEMORE’s visual discrimination power, exemplified by chronic confusion among the Coke can, rubber cement can, and a Campbell’s soup can. The confusion among these three similarly proportioned right cylinders with red and white surface markings accounted for 3 of the 20 errors for the intact test set. An additional 3 errors were explained by confusion between two similar photographs. 4 Discussion 4.1 Recognition Performance and Generalization Behavior. As can be seen in Table 1, color alone is a remarkably powerful cue for object recognition, even in the total absence of shape information. This result is consistent with the results of Swain and Ballard (1991), who successfully recognized objects using color histograms alone and first drew the tentative analogy between histogram bins and neural receptive fields. However, when the color signatures of objects mimic each other and lead to recognition errors, the errors are generally “inexcusable” to a human observer (see Figure 9A). Thus, despite the utility of color for identification of specific instances of objects, the preeminence of shape information in object vision is clear, both for the definition of object categories (Rosch, Mervis, Gray, Johnson, & Boyes-Braem, 1976) and in the process of rapid object naming (Biederman & Ju, 1988). In this context, several aspects of SEEMORE’s shape representations deserve closer examination. First, the ecology of natural object vision gives rise to an apparent contradiction: Generalization in shape space must in some cases permit an object whose global shape has been grossly perturbed to be matched to itself, such as the various tangled forms of a telephone cord, but quasi-rigid basic-level shape categories (e.g., chair, shoe, tree) must be preserved as well and distinguished from each other. A partial resolution to this conundrum lies in the observation that locally computed shape statistics are in large part preserved under the global shape deformations that nonrigid common objects (e.g., scarf, bike chain) typically undergo. A feature-space representation with an emphasis on locally derived shape channels will therefore exhibit a significant degree of invariance to global nonrigid shape deformations. The principal sources
798
Bartlett W. Mel
of change to the local image-plane shape statistics under nonrigid shape deformations include (1) actual bending, rending, or warping of the object surfaces; (2) visual foreshortening as object surfaces change orientation in depth; (3) the addition and subtraction of object features that move in and out of self-occlusion; and (4) the addition and subtraction of spurious local shape statistics across accidental self-occluding boundaries. (Each of the latter three categories of feature changes applies to rigid objects as well.) The representational challenge, then, is to accumulate a sufficiently rich set of local shape statistics, which individually “tolerate” some degree of bending, warping, and foreshortening (categories 1 and 2) so that as a group they overwhelm in importance the feature-space excursions due to “capricious” feature changes (categories 3 and 4). The situation for rigid objects is somewhat less problematic, in that shape statistics are stable over longer spatial scales, and the problems associated with self-occlusion are generally less severe. The definition of shape similarity embodied in the present approach is that two objects are similar if they contain similar profiles (histograms) of their shape measures, which emphasize locality. One way of understanding the emergence of global shape categories, then, such as “book,” “can,” and “dinosaur,” is to view each as a set of instances of a single canonical object whose local shape statistics remain quasi-stable as it is warped into various global forms. In many cases, particularly within rigid object categories, exemplars may share longer-range shape statistics as well. It is useful to consider one further aspect of SEEMORE’s shape representation, pertaining to an apparent mismatch between the simplicity of the shape-related feature channels and the complexity of the shape categories that can emerge from them. Specifically, the order of binding of spatial relations within SEEMORE’s shape channels is relatively low, consisting of single simple open or closed curves, or conjunctions of two oriented contours or Gabor patches. The fact that shape categories, such as “photographs of rooms” or “smooth, lumpy objects,” cluster together in a feature space of such low binding order would therefore at first seem surprising. This phenomenon relates closely to the notion of “wickelfeatures” (Wickelgren, 1969; see also Rumelhart & McClelland, 1986, Chap. 18), where features that bind spatial information only locally—before being globally pooled—are nonetheless able to represent global patterns (words) with little or no residual ambiguity. For example, features sensitive to individual letters—but not their positions—are not sufficient to distinguish words such as ngtcnrooiie and recognition. In contrast, features sensitive to letter pairs, analogous to SEEMORE’s shape features, are typically sufficient to eliminate representational ambiguity (i.e., multiple inverse images), even in the absence of relative positional information among the detected pairs. Thus, not only does no other English word contain the same set of adjacent letter pairs as recognition (co, ec, gn, io, it, ni, og, on, re, ti), but no other possible word contains the same letter pairs (conjecture without proof!). One mechanism
SEEMORE
799
underlying this phenomenon is that the overlapping subcomponents of the various features provide a form of implicit glue or binding that probabilistically constrains the spatial relations between features, even though these are not given explicitly within the representation. Thus the e in ec is likely to be the same c as in re, producing a high a posteriori probability for the compound rec. It is important to recall, however, that in direct opposition to the pressure for precise representation of shape is the pressure to generalize across different shapes within the same shape category or between views of the same object in different configurations. One strategy for coping with this trade-off may be to maintain separate representations at a range of specificities, coupled with a source of expertise to choose between them. 4.2 Rationale for Choice of Feature Channels. Following two main guidelines, SEEMORE’s feature channels were designed as abstractions of known feature types seen in the ventral stream of the primate visual system and to represent information in images that is known to be important for object perception. The level of abstraction was roughly as follows: color-sensitive cells, texture-sensitive cells, cells responding to contour conjunctions, cells distinguishing curved from straight, and so forth. Within these general design guidelines, the specific selectivities and spatial invariances that defined SEEMORE’s feature channels were largely shaped by two additional forces: (1) the contents of the object database, which implicitly rendered certain image cues more salient than others for purposes of classification, and (2) the relatively severe, monolithic nature of the recognition task, which entailed nearly complete viewpoint uncertainty from trial to trial. Both of these influences make it likely that the details of SEEMORE’s feature channels differ from those of any real animal, whose image statistics inevitably differ from those contained in SEEMORE’s database, and for whom very different viewpoint assumptions, requiring different degrees of invariance, hold in different behavioral contexts. As such, SEEMORE’s “neurophysiology” (feature channel responses) and “psychophysics” (recognition performance and patterns of generalization) do not yet provide a detailed or comprehensive model for the neurophysiological or psychophysical record for any particular animal species. Indeed, the fitting of existing empirical data along these lines has not been a goal of this work so far. On the other hand, SEEMORE’s representations fall easily within the family of receptivefield-based representations seen in biological visual systems and, under many of the same challenges faced by “real” visual systems, have exhibited levels of recognition performance and patterns of generalization that seem surprising given the simplicity of the underlying computations. Given that the scale of visual hardware in the ventral stream of the primate visual system devoted to object recognition likely outstrips SEEMORE’s 102 feature channels by multiple orders of magnitude, it seems a reasonable possibility that the gap between the performance of a visual “simpleton” like SEEMORE
800
Bartlett W. Mel
and his much more gifted cousins in the animal kingdom is largely a matter of scale. In contrast to the engineering-driven approach adopted here, features could have in principle been learned. Unsupervised learning approaches have been used to discover statistical structure in natural images that is strikingly close to the structure of simple cell-receptive fields in the mammalian visual system (Olshausen & Field, 1996). In the context of recognition, however, the problem of feature learning is more difficult in that statistical structure present in an ensemble of images may or may not be useful. Utility of image features depends on the specific classification problem that the animal faces, which can change from moment to moment; consider the processing of faces for identity versus age versus gender versus emotional state, and so forth. The discovery of useful structure is thus a problem requiring task-dependent supervision and requiring potentially large quantities of labeled data. If the features needed for recognition are highly descriptive, as needed in difficult recognition domains, then the search spaces for either unsupervised or supervised approaches to feature learning are huge. High representational biases could perhaps make such approaches more tractable.
4.3 Limitations and Extensions. The presegmentation of objects in the test sets used here is a simplifying assumption that is clearly invalid in the real world. The advantage of the assumption from a methodological perspective is that the object similarity structure induced by the feature dimensions can be studied independent of the problem of segmenting or indexing objects embedded in complex scenes. Thus, a representation that does not permit good discrimination among a large number of objects over a range of 3D viewpoints and internal configurations and does not impose appropriate category structure among objects need not be considered further as a component of a system that deals with objects embedded within scenes. The cluttered test condition (see Figure 3C) was designed to assay explicitly the sensitivity of SEEMORE’s visual representations to additive pattern noise. While the figures in Table 1 show a seven-fold increase in error rate for cluttered test images relative to the intact condition (21 percent versus 3 percent), these figures must be interpreted with care. First, as for the case of intact images, approximately half (44 percent) of the “errors” in the cluttered case were second-best matches from a database of 100 objects and so do not represent total recognition failure. Moreover, the act of cluttering (or occluding or color shifting) a test image can introduce legitimate accidental similarities between the modified test view and a view of another (i.e., “incorrect”) object. Since SEEMORE is entirely unsuspecting of these manipulations and is asked simply to compare incoming views to familiar views at face value, some of the committed errors may be excusable, to the
SEEMORE
801
extent that the degraded object view does look most similar to a view of an incorrect object to the eye of a human observer. When pattern clutter is made more severe than that illustrated in Figure 3C, however, recognition performance suffers dramatically. One of the main reasons for this is that SEEMORE’s feature channels are currently far from sparse in their activity profiles across images; most feature channels are activated to some degree by most object views. Under these circumstances, the representations of multiple objects in the visual field additively collide within the same feature channels, making separate recognition difficult or impossible. Two kinds of remedies for this problem have been suggested by biology and prior work. The first is crude image segmentation based on a movable fovea or other focal attention system (Koch & Ullman, 1985; Niebur & Koch, 1994). This type of strategy typically minimizes, but cannot eliminate, the additive visual noise induced by extraneous objects outside the focus of attention. The second, more potent remedy is the leap to sparse, very-high-dimensional space, whose mathematical advantages for vision and recognition have been discussed at length elsewhere (Kanerva, 1988; Califano & Mohan, 1994). For example, in a domain of 2D line drawings, Califano and Mohan (1994) have demonstrated position, orientation, and scale-invariant recognition in multiple-object scenes with no prior segmentation, using on the order of 106 highly descriptive contour-based features (based on conjunctions of triplets of contour segments). Intuitively, an object is recognized when a sufficient number of sufficiently distinctive features “vote” for the presence of a particular object, regardless of whether other objects also get votes. The leap into high-dimensional feature space is easily arranged combinatorially, by conjoining existing (or other) low-order features into compound features of higher order. The issue of feature locality remains crucial, however. If the desired combinatorial explosion is achieved by conjoining elemental shape tokens over relatively long distances in the image, the resulting shape-based similarity metric can permit extremely fine shape distinctions but may lead to poor generalization under nonrigid shape deformation and partial occlusion. This approach is most relevant in feature-poor images, in which the variety of local token configurations is limited. In feature-dense images, such as natural images, the combinatorial explosion can be achieved while maintaining locality within elemental features configurations, thus preserving the constellation of desirable representational properties that locality confers. One means of effectively increasing the richness of localized feature configurations is to include elemental filters simultaneously sensitive to multiple visual cues, such as contour, color, shading, and texture cues. Interestingly, the preferred responses of pattern-selective cells in the primate IT cortex frequently involve explicit binding of contour shapes with particular colors and/or textures (Kobatake & Tanaka, 1994). In this context, we may interpret these data as evidence of an implicit neural strategy whose
802
Bartlett W. Mel
goal is to achieve sparse high dimensionality in order to simplify object indexing in cluttered scenes, while retaining locality, in order to facilitate viewpoint-, configuration-, and occlusion-insensitive shape-based generalization. In continuing work, we are pursuing this course. Acknowledgments Thanks to Jozsef ´ Fiser for useful discussions and for development of the Gabor-based channel set, to Dan Lipofsky and Scott Dewinter for helping in the construction of the image database, and to Christof Koch for providing support at Cal Tech where this work was initiated. This work was funded by the Office of Naval Research, and the McDonnell-Pew Foundation. References Amit, Y., Geman, D., & Wilder. K. (1995). Recognizing shapes from simple queries about geometry (Tech. Rep.). Chicago: University of Chicago, Department of Statistics. Biederman, I., & Ju, G. (1988). Surface vs. edge-based determinants of visual recognition. Cognitive Psychology, 20, 38–64. Califano, A., & Mohan, R. (1994). Multidimensional indexing for recognizing visual shapes. IEEE Trans. on PAMI, 16, 373–392. Desimone, R., Albright, T. D., Gross, C., & Bruce, C. (1984). Stimulus selective properties of interior temporal neurons in the macaque. J. Neurosci., 4, 2051– 2062. DiGirolamo, G., & Kanwisher, N. (1995). Accessing stored representations begins within 155 ms in object recognition. Paper presented at the 36th Annual Meeting of the Psychonomics Society, Los Angeles. Duda, R. O., & Hart, P. (1973). Pattern classification and scene analysis. New York: Wiley. Edelman, S., & Bulthoff, H. (1992). Orientation dependence in the recognition of familiar and novel views of three-dimensional objects. Vision Res., 32, 2385– 2400. Friedman, J. H. (1994). Flexible metric nearest neighbor classification (Tech. Rep.). Stanford University, Department of Statistics and Stanford Linear Accelerator Center. Fukunaga, K. (1990). Introduction to statistical pattern recognition. New York: Academic Press. Fukushima, K., Miyake, S., & Ito, T. (1983). Neocognitron: A neural network model for a mechanism of visual pattern recognition. IEEE Trans. Sys. Man and Cybernetics, SMC-13, 826–834. Gochin, P. M., Colombo, M., Dorfman, G. A., Gerstein, G., & Gross, C. (1994). Neural ensemble coding in inferior temporal cortex. J. Neurophysiol., 71, 2325– 2337. Gross, C., Rocha-Miranda, C., & Bender, D. (1972). Visual properties of neurons in inferotemporal cortex of the macaque. J. Neurophysiol., 35, 96–111.
SEEMORE
803
Ito, M., Tamura, H., Fujita, I., & Tanaka, K. (1995). Size and position invariance of neuronal responses in monkey inferotemporal cortex. J. Neurophysiol., 73, 218–226. Kanerva, P. (1988). Sparse distributed memory. Cambridge, MA: MIT Press. Kobatake, E., & Tanaka, K. (1994). Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex. J. Neurophysiol., 71, 856–867. Koch, C., & Ullman, S. (1985). Shifts in selective visual attention: Towards the underlying neural circuitry. Human Neurobiol., 4, 219–227. Lades, M., Vorbruggen, J., Buhmann, J., Lange, J., Malsburg, C., Wurtz, R., & Konen, W. (1993). Distortion invariant object recognition in the dynamic link architecture. IEEE Trans. Computers, 42, 300–311. Le Cun, Y., Matan, O., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., Jackel, L., & Baird, H. (1990). Handwritten zip code recognition with multilayer networks. In Proc. of the 10th Int. Conf. on Patt. Rec. Los Alamitos, CA: IEEE Computer Science Press. Logothetis, N., Pauls, J., Bulthoff, H., & Poggio, T. (1994). View-dependent object recognition by monkeys. Current Biology, 4, 401–414. Mel, B. (1996). Seemore: A view-based approach to 3-D object recognition using multiple visual cues. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems (Vol. 8, pp. 865–871). Cambridge, MA: MIT Press. Mishkin, M. (1982). A memory system in monkey. Philos. Trans. R. Soc. Lond. B., 298, 85–95. Murase, H., & Nayar, S. (1995). Visual learning and recognition of 3-D objects from appearance. International Journal of Computer Vision, 14, 5–24. Niebur, E., & Koch, C. (1994). A model for the neuronal implementation of selective visual attention based on temporal correlation among neurons. Journal of Computational Neuroscience, 1(1), 141–158. Olshausen, B., & Field, D. (1996). Emergence of simple-cell receptive-field properties by learning a sparse code for natural images. Nature, 381, 607–609. Oram, M., Perrett, D. (1992). Time course of neural responses discriminating different views of the face and head. J. Neurophysiol., 68(1), 70–84. Perrett, D., Rolls, E., & Caan, W. (1982). Visual neurons responsive to faces in the monkey temporal cortex. Exp. Brain Res., 47, 329–342. Pitts, W., & McCullough, W. (1947). How we know universals: the perception of auditory and visual forms. Bull. Math. Biophys., 9, 127–147. Potter, M. (1976). Short-term conceptual memory for pictures. J. Exp. Psychol.: Human Learning and Memory, 2, 509–522. Rao, R., & Ballard, D. (1995). An active vision architecture based on iconic representations. Artificial Intelligence, 78, 461–505. Rosch, E., Mervis, C., Gray, W., Johnson, D., & Boyes-Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8, 382–439. Rumelhart, D., & McClelland, J. (1986). Parallel distributed processing. Cambridge, MA: MIT Press.
804
Bartlett W. Mel
Schiele, B., & Crowley, J. (1996). Probabilistic object recognition using multidimensional receptive field histograms. In Proc. of the 13th Int. Conf. on Patt. Rec. (Vol. 2, pp. 50–54). Los Alamitos, CA: IEEE Computer Society Press. Schwartz, E., Desimone, R., Albright, T., & Gross, C. (1983). Shape recognition and inferior temporal neurons. Proc. Natl. Acad. Sci. USA, 80, 5776–5778. Subramaniam, S., Biederman, I., Kalocsai, P., & Madigan, S. R. (1995). Accurate identification, but chance forced-choice recognition for rsvp pictures. Poster presented at the Annual Meeting of the Association for Research in Vision and Ophthalmology, Ft. Lauderdale, FL. Swain, M. J., & Ballard, D. (1991). Color indexing. Int. J. Computer Vision, 7, 11–32. Viola, P. (1993). Feature-based recognition of objects. In Proc. of the AAAI Fall Symposium on Learning and Computer Vision, 60–65. Wickelgren, W. A. (1969). Context-sensitive coding, associative memory, and serial order in (speech) behavior. Psych. Rev., 76, 1–15.
Received April 29, 1996; accepted September 19, 1996.
Communicated by Peter Konig
Image Segmentation Based on Oscillatory Correlation DeLiang Wang Department of Computer and Information Science and Center for Cognitive Science, The Ohio State University, Columbus, Ohio 43210, USA
David Terman Department of Mathematics, The Ohio State University, Columbus, Ohio 43210, USA
We study image segmentation on the basis of locally excitatory, globally inhibitory oscillator networks (LEGION), whereby the phases of oscillators encode the binding of pixels. We introduce a lateral potential for each oscillator so that only oscillators with strong connections from their neighborhood can develop high potentials. Based on the concept of the lateral potential, a solution to remove noisy regions in an image is proposed for LEGION, so that it suppresses the oscillators corresponding to noisy regions but without affecting those corresponding to major regions. We show that the resulting oscillator network separates an image into several major regions, plus a background consisting of all noisy regions, and we illustrate network properties by computer simulation. The network exhibits a natural capacity in segmenting images. The oscillatory dynamics leads to a computer algorithm, which is applied successfully to segmenting real gray-level images. A number of issues regarding biological plausibility and perceptual organization are discussed. We argue that LEGION provides a novel and effective framework for image segmentation and figure-ground segregation. 1 Introduction The segmentation of a visual scene (image) into a set of coherent patterns (objects) is a fundamental aspect of perception that underlies a variety of tasks, such as image processing, figure-ground segregation, and automatic target recognition. Scene segmentation plays a critical role in the understanding of natural scenes. Although humans perform it with apparent ease, the general problem of image segmentation remains unsolved in sensory information processing. As the technology of single-object recognition has become more and more advanced, the demand for a solution to image segmentation has increased since both natural scenes and manufacturing applications of computer vision are rarely composed of a single object. Objects appear in a natural scene as the grouping of similar sensory features and the segregation of dissimilar ones. Sensory features are generally Neural Computation 9, 805–836 (1997)
c 1997 Massachusetts Institute of Technology °
806
DeLiang Wang and David Terman
taken to be local, and in the simplest case they may correspond to single pixels. To approach the problem of scene segmentation, three basic issues must be addressed: What are the cues that determine grouping and segregation? What is the proper representation for the result of segmentation? How are the cues used to give rise to segmentation? Much is known about sensory cues that are important for segmentation. In particular, Gestalt psychology has uncovered a set of principles guiding the grouping process in the visual domain (Wertheimer, 1923; Koffka, 1935; Rock & Palmer, 1990). These principles work together to produce segmentation. We briefly summarize some of the most important principles (see also Rock and Palmer, 1990): • Proximity. The closer the features lie to each other, the easier they are to be grouped into the same segment. • Similarity. Features that have similar attributes, such as grayness, color, depth, or texture, tend to group together. • Common fate. Features that have similar temporal behavior tend to group together. For instance, a group of features that move coherently (common motion) would form a single object. Notice that common fate may be regarded as one aspect of similarity. We list it separately to emphasize the importance of time as a separate dimension. • Connnectedness. A uniform, connected region, such as a spot, line, or more extended area, tends to form a single segment. • Good continuation. A set of features that form a smooth and continuous curve tend to group together. • Prior knowledge. If a set of features belong to the same familiar pattern, they tend to group together. In computer vision algorithms for image segmentation, the result of segmentation can be represented in many ways. However, it is not a trivial task to represent the outcome of segmentation in a neural network. One proposal is naturally derived from the so-called neuron doctrine (Barlow, 1972), where neurons at higher brain areas are assumed to become more selective and eventually a single neuron represents each single object (the grandmother cell representation). Multiple objects in a visual scene would be represented by the coactivation of multiple units at some level of the nervous system. This representation faces major theoretical and neurobiological problems (von der Malsburg, 1981; Abeles, 1991; Singer, 1993). Another proposal relies on temporal correlation to encode the binding (Milner, 1974; von der Malsburg, 1981; Abeles, 1982). In particular, the correlation theory of von der Malsburg (1981) asserts that an object is represented by the temporal correlation of the firing activities of the scattered cells that encode different features of the object. Multiple objects are represented by differ-
Image Segmentation
807
ent correlated firing patterns that alternate in time, each corresponding to a single object. Temporal correlation provides an elegant way to represent the result of segmentation. A special form of temporal correlation is oscillatory correlation, where the basic unit is a neural oscillator (see Terman & Wang, 1995; Wang & Terman, 1995a). However, this representation does not by itself reveal how segmentation is achieved using Gestalt grouping principles. Despite an extensive body of literature dealing with segmentation using temporal correlation (starting perhaps from von der Malsburg & Schneider, 1986), little progress has been made in building successful neural systems for image segmentation. There are two major challenges facing the oscillatory correlation theory. The first challenge is how to achieve fast synchronization within a population of locally coupled oscillators. Most of the models proposed for achieving phase synchrony rely on all-to-all connections (see Section 2 for more details). However, as Sporns, Tononi, and Edelman (1991) and Wang (1993a) pointed out, a network with full connections indiscriminately connects all the oscillators activated simultaneously by different objects because the network is dimensionless and loses critical information about geometry. The second challenge is how to achieve fast desynchronization among different groups of oscillators representing distinct objects. This is necessary in order to segment multiple objects simultaneously presented. We have previously proposed a neural network framework to deal with the problem of image segmentation, called locally excitatory globally inhibitory oscillator networks (LEGION) (Wang & Terman, 1995a; Terman & Wang, 1995). Each oscillator is modeled as a standard relaxation oscillator. Local excitation is implemented by positive coupling between neighboring oscillators, and global inhibition is realized by a global inhibitor. LEGION exhibits the mechanism of selective gating, whereby oscillators stimulated by the same pattern tend to synchronize due to local excitation and oscillator groups stimulated by different patterns tend to desynchronize due to global inhibition (Wang & Terman, 1995a; Terman & Wang, 1995). We have proven that with the selective gating mechanism, LEGION rapidly achieves both synchronization within groups of oscillators that are stimulated by connected regions and desynchronization between different groups. In sum, LEGION provides an elegant solution to both challenges outlined above. In this article, we study LEGION for segmenting real images. Before we demonstrate image segmentation, the original version of LEGION needs to be extended to handle images with many tiny (noisy) regions. One such example is shown in Figure 1, where three objects with a noisy background form a visual image. Without extension, LEGION would treat each region, no matter how small it is, as a separate segment. Thus, it would lead to many tiny fragments. We call this problem fragmentation. A more serious problem is that it is difficult to choose parameters so that LEGION is able to achieve more than several (5 to 10) segments (Terman & Wang, 1995). Noisy fragments may therefore compete with major image regions for becoming
808
DeLiang Wang and David Terman
Figure 1: A caricature of an image with three objects that appears on a noisy background. The noiseless caricature is adapted from Terman and Wang (1995).
segments, so that it may not be possible to extract the major segments from an image. The problem of fragmentation is solved by introducing a concept of lateral potential for each oscillator. The extended dynamics is fully analyzed (Wang & Terman, 1996), and the resulting LEGION network is applied to gray-level images and yields successful segmentation. A preliminary version of this work was presented in Wang and Terman (1995b). In the next section we review prior work relevant to image segmentation and neural networks. In section 3, our model is described in detail. In section 4, computer simulations of the extended LEGION network are presented. Section 5 presents the segmentation results on real images. Further discussions concerning our approach are given in section 6. 2 Related Work 2.1 Image Segmentation Algorithms. Due to its critical importance for computer vision, image segmentation has been studied extensively. Many techniques have been invented (for reviews of the subject see Zucker, 1976;
Image Segmentation
809
Haralick, 1979; Haralick & Shapiro, 1985; Sarkar & Boyer, 1993b). There are three broad categories of algorithms: pixel classification, edge-based organization, and region-based segmentation. A simple classification technique is thresholding: A pixel is assigned a specific label if some measure of the pixel passes a certain threshold. This idea can be extended to a complex form, including multiple thresholds, which are determined by pixel histograms (Kohler, 1981). Edge-based techniques generally start with an edge-detection algorithm, which is followed by grouping edge elements into rectilinear or curvilinear lines. These lines are then grouped into boundaries that can be used to segment images into various regions (see, for example, Geman, Geman, Graffigne, & Dong 1990; Sarkar & Boyer, 1993a; Foresti, Murino, Regazzoni, & Vernazza, 1994). Finally, region-based techniques operate directly on regions. A classical method is region growing and splitting (or split-and-merge; see Horowitz & Pavlidis, 1976; Zucker, 1976; Adams & Bischof, 1994), where iterative steps are taken to grow (or split) pixels into a connected region if all the pixels in the region satisfy some conditions. One of the apparent deficits with these algorithms is their iterative (serial) nature (Liou, Chiu, & Jain, 1991). There are some recent algorithms that are partially parallel (Liou, Chiu, & Jain, 1991; Mohan & Nevatia, 1992; Manjunath & Chellappa, 1993). Most of these techniques rely on domain-specific heuristics to perform segmentation, and no unified computational framework exists to explain the general phenomenon of scene segmentation (Haralick & Shapiro, 1985). The problem of scene segmentation is computationally hard (Gurari & Wechsler, 1982) and largely regarded as unsolved. 2.2 Neural Network Efforts. Neural networks have proven to be a successful approach to pattern recognition (Schalkoff, 1992; Wang, 1993b). Unfortunately, little work has been devoted to scene segmentation, which is generally regarded as part of preprocessing (often meaning manual segmentation). Scene segmentation is a particularly challenging task for neural networks, partly because traditional neural networks lack the representational power for encoding multiple objects simultaneously. Interesting schemes of segmentation based on learning have been proposed (Sejnowski & Hinton, 1987; Mozer, Zemel, Behrmann, & Williams, 1992). Grossberg and Wyse (1991) proposed a model for segmentation based on the contour detection model of Grossberg and Mingolla (1985; see Gove, Grossberg, & Mingolla, in press, for a computer simulation). However, all of these methods were tested on only small synthetic images, and it is not clear how they can be extended to handle real images. Also, Kohonen’s self-organizing maps have been used for segmentation based on pixel classification (Kohonen, 1995; Koh, Suk, & Bhandarkar, 1995). A primary drawback of these methods is that the number of segments (objects) is assumed to be known a priori. Because temporal (oscillatory) correlation offers an elegant way of representing multiple objects in neural networks (von der Malsburg & Schnei-
810
DeLiang Wang and David Terman
der, 1986), most of the neural network efforts on image segmentation have centered around this theme. In particular, the discovery of synchronous oscillations in the visual cortex has triggered much interest in exploring oscillatory correlation to solve the problems of segmentation and figure-ground segregation. One type of model uses all-to-all connections to reach synchronization (Wang, Buhmann, & Malsburg, 1990; Sompolinsky, Golomb, & Kleinfeld, 1991; von der Malsburg and Buhmann, 1992). As explained in section 1, these models cannot extend very far in solving the segmentation problem because fundamental information concerning the geometry among sensory features is lost. Another type of model uses lateral connections to reach synchrony (Sporns, et al., 1991; Murata & Shimizu, 1993; Schillen & Konig, ¨ 1994). Unfortunately, it is unclear to what extent these oscillator networks can synchronize on the basis of local connectivity since no analysis is given and only simulation results on small networks are provided. Moreover, recent insights into the contrasting behavior between sinusoidal and relaxation oscillators make it clear that sinusoid-typed oscillators, which encompass most of the oscillator models used, have severe limitations to support fast synchronization (Wang, 1995; Terman & Wang, 1995; Somers & Kopell, 1995). In fact, in all of the above models, nothing close to a real image has ever been used for testing these models. 3 Model Description The building block of LEGION, a single oscillator i, is defined as a feedback loop between an excitatory unit xi and an inhibitory unit yi , whose time derivatives are defined as x0i = 3xi − x3i + 2 − yi + Ii H(pi + exp(−αt) − θ ) + Si + ρ
(3.1a)
y0i = ε(γ (1 + tanh(xi /β)) − yi ).
(3.1b)
Here H stands for the Heaviside step function, which is defined as H(ν) = 1 if ν ≥ 0 and H(ν) = 0 if ν < 0. Ii represent external stimulation, which is assumed to be applied from time 0 on, and Si denotes the coupling from other oscillators in the network. ρ denotes the amplitude of gaussian noise, the mean of which is set to −ρ. The negative mean is used to reduce the chance of self-generating oscillations, which will become clear in the next paragraph. The noise term is introduced for two purposes. The first one is obvious: to test the robustness of the system. The second one, perhaps more important, is to play an active role in separating different input patterns (for more discussion, see Terman & Wang, 1995). The parameter ε is a small, positive number. Hence equation 3.1, without any coupling or noise and with constant stimulation, corresponds to a standard relaxation oscillator. The x-nullcline of equation 3.1 is a cubic curve, while the y-nullcline is a sigmoid function. If I > 0 and H = 1, these curves
Image Segmentation
811
intersect along the middle branch of the cubic when β is small. In this case, we call the oscillator enabled (see Figure 2A). It produces a stable periodic orbit, which alternates between silent and active phases of near steady-state behavior. As shown in Figure 2A, the silent and the active phases correspond to the left L and the right R branches of the cubic, respectively. The transitions between the two phases occur rapidly (and are thus referred to as jumping). Notice that the trajectory of the oscillator in phase space jumps between the two branches and then follows closely the branches, because the small ε induces very different time scales for x- and y-dynamics. The parameter γ is used to control the ratio of the times that the solution spends in these two phases. For a larger value of γ , the solution spends a shorter time in the active phase. If I ≤ 0 and H = 1, the nullclines of equation 3.1 intersect at a stable fixed point along the left branch of the cubic (see Fig. 2B). In this case, equation 3.1 produces no periodic orbit, and the oscillator is referred to as excitable, indicating that the oscillator has not yet been but can be excited by stimulation. An excitable oscillator may become oscillatory if it receives, through the term S, large enough coupling from other oscillators. Because of this dependency on external stimulation, the oscillations are stimulus dependent. We say that the oscillator is stimulated if I > 0, and unstimulated if I ≤ 0. The parameter β specifies the steepness of the sigmoid function and is chosen to be small. The oscillator model (see equation 3.1) may be interpreted as a model for the spiking behavior of a single neuron, the envelope of a bursting neuron, or a mean field approximation to a network of excitatory and inhibitory binary neurons. The primary difference between equation 3.1 and the model in Terman and Wang (1995) is the introduction of the Heaviside function in which α > 0 and 0 < θ < 1. The parameter α is chosen to be on the same order of magnitude as ε so that the exponential function decays on a slow time scale. It is the Heaviside term that allows the network to distinguish between major blocks and noisy fragments. The basic idea is that a major block must contain at least one oscillator, denoted as a leader, which lies in the center of a large, homogeneous region. This oscillator will be able to receive large lateral excitation from its neighborhood. A noisy fragment does not contain such an oscillator. The variable pi in equation 3.1a determines whether an oscillator is a leader. It is referred to as the lateral potential of the oscillator i, and it satisfies the differential equation: " # X 0 Tik H(xk − θx ) − θp − µpi . (3.2) pi = λ(1 − pi )H k∈N(i)
Here λ > 0, Tik is the permanent connection weight (explained later) from oscillator k to i, and N(i) is called the neighborhood of i. If the weighted sum oscillator i receives from N(i) exceeds the threshold θp , pi approaches 1. If this weighted sum is below θp , pi relaxes to 0 on a time scale determined by µ, which is chosen to be on the same order as ε resulting in a slow time scale.
812
DeLiang Wang and David Terman
Figure 2: Nullclines and orbits of a single oscillator. (A) If I > 0 and H = 1, the oscillator is enabled. The periodic orbit is shown with a bold curve, and its direction of motion is indicated by the arrowheads. The left and the right branches of the x-nullcline are labeled L and R, respectively. LK and RK indicate the left and the right knees of the cubic, respectively. (B) If I ≤ 0 and H = 1, the oscillator is excitable. The fixed point PI on the left branch of the cubic is asymptotically stable.
Image Segmentation
813
Figure 3: Architecture of a two-dimensional LEGION network with four nearest-neighbor coupling. An oscillator is indicated by an open circle, and the global inhibitor is indicated by the filled circle.
It follows that pi can exceed the threshold θ in equation 3.1a only if i is able to receive a large enough lateral excitation from its neighborhood. In order to develop a high potential, it is not sufficient that a large number of neighbors of i are oscillatory. They must also have a certain degree of synchrony in their oscillations. In particular, they all must exceed the threshold θx at the same time in their oscillations. The purpose of introducing the lateral potential is that an oscillator with a high potential can lead the activation of an oscillator block corresponding to an object. Although a high-potential oscillator need not be stimulated, it must be stimulated in order to play the role of leading an oscillator block; otherwise, the oscillator will not oscillate at all. Thus, we require that a leader be always stimulated. More formally, an oscillator i is defined as a leader if pi ≥ θ and i is stimulated. The lateral potential of every oscillator is initialized to zero. The network we study for image segmentation is two-dimensional. Figure 3 shows the simplest case of permanent connectivity, where an oscillator is connected only with its four immediate neighbors except on the boundaries where no wraparound is used. Such connectivity forms a twodimensional grid. In general, however, N(i) should be larger, and the permanent connection weights should take on the form of a gaussian distribution with their distance. The coupling term Si in equation 3.1 is given by Si =
X k∈N(i)
Wik H(xk − θx ) − Wz H(z − θxz ),
(3.3)
814
DeLiang Wang and David Terman
where Wik is the dynamic connection weight from k to i. The neighborhood of the above summation is chosen to be the same as equation 3.2. In some situations, however, they should be chosen differently to achieve good results, and an alternative definition with two different neighborhoods is given elsewhere (Wang & Terman, 1996). Now let us explain permanent and dynamic connection weights. To facilitate synchrony and desynchrony, we assume that there are two kinds of synaptic weights (links) between two oscillators following von der Malsburg who argued for its neurobiological plausibility (von der Malsburg, 1981; von der Malsburg & Schneider, 1986; see also Crick, 1984). The permanent weight, or Tik , embodies the hardwired structure of a network. On the other hand, the dynamic weight, or Wik , rapidly changes. Wik is formed on the basis of Tik according to the mechanism of dynamic normalization (Wang, 1995). Dynamic normalization was previously defined as a two-step procedure: first update dynamic links and then normalization (Wang, 1995; Terman & Wang, 1995). There are different ways to realize such normalization. In the following, we give one way to implement dynamic normalization in differential equations, u0i = η(1 − ui )Ii − νui Wik0 = WT Tik ui uk − Wik
(3.4a) X
Tij ui uj .
(3.4b)
j∈N(i)
The function ui measures whether oscillator i is stimulated, and it is initialized to 0. The parameter η determines the rate of updating ui . When Ii > 0, ui → 1 quickly because we choose η À ν (see below); otherwise when Ii = 0, ui = 0. For this equation we assume Ii = 0 if oscillator i is unstimulated (otherwise it is easy to enforce this by applying a step function on Ii ). The parameter ν is chosen to be on the same order as ε, so that ui slowly relaxes back to 0 after the external stimulus is withdrawn. We assume that Wik are initialized to 0 for all i and k. It is easy to see that if oscillator i is unstimulated, Wik remains to be 0 for all k, and if oscillator k is unstimulated Wik = 0 for all i. Otherwise, if ui = 1 and uk = 1 for at least one k ∈ N(i), then at equilibrium, X WT Tik ui uk and Wik = WT . Wik = P j∈N(i) Tij ui uj k∈N(i) Thus the total dynamic weights converging to a single oscillator equals WT , which gives the desired normalization. Notice that dynamic weights, not permanent weights, participate in determining Si (see equation 3.3). Moreover, Wik can be properly set up in one step at the beginning based on external stimulation, which should be useful for engineering applications. Weight normalization is not a necessary condition for the selective gating mechanism to work. This conclusion has been established previously
Image Segmentation
815
(Terman & Wang, 1995). With normalized weights, however, the quality of synchronization within each oscillator block is better (Terman & Wang, 1995). In equation 3.3, Wz is the weight of inhibition from the global inhibitor z, whose activity, also denoted by z, is defined as z0 = φ(σ∞ − z),
(3.5)
where σ∞ = 1 if xi ≥ θzx for at least one oscillator i, and σ∞ = 0 otherwise. Hence θzx represents another threshold, and it is chosen so that only an oscillator jumping to the active phase can trigger the global inhibitor. If σ∞ equals 1, z → 1. The parameter φ represents the rate at which the inhibitor reacts to the stimulation from the oscillator network. The introduction of a lateral potential provides a solution to the problem of fragmentation. There is an initial period when the term exp(αt) exceeds the threshold θ . During this period, every stimulated oscillator is enabled. This allows the leaders to receive sufficient lateral excitation so that they can achieve a high potential. After this initial period, the only oscillators that can jump up without stimulation from other oscillators are the leaders. When a leader jumps up, it spreads its activity to other oscillators within its own block, so they also can jump up. These oscillators are referred to as followers. Oscillators not in this block are prevented from jumping up because of the global inhibitor. The oscillators that belong to the noisy fragments will not be able to jump up beyond the initial period because these oscillators will not be able to develop a sufficiently high potential by themselves and they cannot be recruited by leaders. These oscillators are referred to as loners. In order to be oscillatory beyond the initial time period, an oscillator must be either a leader or a follower. This indicates that the oscillator is not part of a noisy fragment, because noisy fragments in an image tend to be small and isolated (see Figure 1). The collection of all noisy regions whose corresponding oscillators are loners is called the background, which is not a uniform region and generally discontiguous. We have proven a number of rigorous results concerning the system equations 3.1 to 3.5. Our main result implies that the loners will no longer be able to oscillate after an initial time period. Moreover, the asymptotic behavior of a leader or a follower is precisely the same as the network obtained by simply removing all the loners. Together with the results in Terman and Wang (1995), this implies that after a number of oscillation cycles, a block of oscillators corresponding to a single major image region will oscillate in synchrony, while any two oscillator blocks corresponding to two major regions will desynchronize from each other. Also, the number of cycles required for full segmentation is no greater than the number of major regions plus one. The details of the analysis are given in Wang and Terman (1996).
816
DeLiang Wang and David Terman
The analysis in Wang and Terman (1996) is constructive in the sense that it leads to precise estimates that the parameters must satisfy. It shows that the results hold for a robust range of parameter values. Moreover, the analysis does not depend on the precise form of nonlinear functions in equation 3.1. The specific cubic and sigmoid functions (see Figure 2) are used because of their simplicity. In addition to the parameter description given earlier, we require that 0 < θ < 1, and α be chosen so that all the stimulated oscillators remain enabled for the first cycle, but only leaders remain enabled during the second cycle. In equation 3.2, we simply require that λ be on the same order of magnitude as 1, and 0 < θp < 1. The parameter η in equation 3.4 simply needs to be on the same order of 1. There are alternative ways of defining the model without affecting its essential dynamics (Wang & Terman, 1996). In particular, we have given a definition where dynamic normalization of connection strengths in equation 3.4 is not needed, but the quality of synchrony within each block and the flexibility for choosing parameters seem somewhat lessened. 4 Computer Simulation To illustrate how the LEGION network is used for image segmentation while eliminating fragmentation, we have simulated a 50 × 50 grid of oscillators with a global inhibitor as defined by equations 3.1 through 3.5. We map the three objects (designated as the sun, a tree, and a mountain) in Figure 1, and then add 20% noise so that each uncovered square has a 20% chance of being covered (stimulated). The resulting image is shown in Figure 4A. In the simulation, N(i) is the four nearest neighbors without boundary wraparound. For all the stimulated oscillators I = 0.2, while for the others I = 0. Notice that if oscillator i is unstimulated, Wik = Wki = 0 for all k, and Ii does not need to be negative to prevent i from oscillating. The amplitude ρ of the gaussian noise was set to 0.02. This represents a 10 percent noise level compared to the external stimulation. We observed during the simulations that noise facilitated the process of desynchronization. The differential equations equations (3.1–3.5) were solved using both a fourth-order Runge-Kutta method and the adaptive grid ordinary differential equations solver LSODE. Permanent connections between any two neighboring oscillators were set to 2.0, and for total dynamic connections (see equation 3.4b), WT = 8.0. Dynamic weights Wik were set up at the beginning according to equation 3.4. The following values for the other parameters in equations 3.1 through 3.5 were used: ε = 0.02, α = 0.005, β = 0.1, γ = 6.5, θ = 0.9, λ = 0.1, θx = −0.5, θp = 7.0, Wz = 1.5, η = 1.0, µ = ν = 0.01, φ = 3.0, and θzx = θxz = 0.1. The value of θp was chosen so that in order to achieve a high potential, an oscillator must have all four neighbors active. The simulation results were robust to considerable changes in the parameter values. Figures 4B–E shows the instantaneous activity (snapshot) of the network at various stages of dynamic evolution. The diameter
Image Segmentation
817
Figure 4: (A) An image composed of three patterns on a noisy background. The image is mapped to a 50 × 50 LEGION network. Each square corresponds to an oscillator. If a square is entirely covered, the corresponding oscillator receives external input; otherwise, the oscillator receives no external input. In the figure, B–E correspond to the case with the inclusion of the lateral potential, whereas F–J correspond to the case without the lateral potential. (B) A snapshot at the beginning of dynamic evolution. (C–E) Snapshots subsequently taken shortly after B. (F) A snapshot at the beginning of dynamic evolution for the case without the lateral potential. (G–J) Snapshots subsequently taken shortly after F.
818
DeLiang Wang and David Terman
of each black circle represents the x activity of the corresponding oscillator. Specifically, if the range of x values of all the oscillators is given by xmin and xmax , then the diameter of the black circle corresponding to one oscillator is set to be proportional to (x − xmin )/(xmax − xmin ). Figure 4B shows the snapshot at the beginning of the dynamic evolution. This is included to illustrate the random initial conditions. Figure 4C shows a snapshot shortly after Figure 4B. The effect of synchrony and desynchrony is clear: All the stimulated oscillators that belong to or are the neighbors of the sun are entrained and have large activities (in the active phase). At the same time, the oscillators stimulated by the rest of the image have very small activities (in the silent phase). Thus the noisy sun is segmented from the rest of the image. A short time later, as shown in Figure 4D, the oscillators in the group representing the noisy tree reach their active phase and are separated from the rest of the image. Figure 4E shows another snapshot, when the noisy mountain has its turn to be activated and separate from the rest of the input. This successive “pop-out” of the segments continues in a stable periodic fashion until the input image is withdrawn. To illustrate the entire segmentation process, Figure 5 shows the temporal evolution of every stimulated oscillator. The activities of the oscillators stimulated by each noisy object are combined together as one trace in the figure, and so are for the background. Since the oscillators receiving no external stimulation remain excitable and unable to oscillate throughout the simulation process, they are excluded from the display. The three upper traces represent the activities of the three oscillator blocks, and the fourth represents the background, consisting of all the scattered dots. Because of low potentials, these oscillators quickly become excitable even though they are enabled at the beginning. The bottom trace represents the activity of the global inhibitor. The synchrony within each block and desynchrony between different blocks are clearly shown after three cycles. To illustrate the role of the lateral potential, the same network with the same input (see Figure 4A) and the same initial condition has been simulated without the lateral potential. In this case, the Heaviside function in equation 3.1a is always 1 for every oscillator. With the random initial condition shown in Figure 4F, the network reaches a stable oscillatory behavior with four segments after fewer than three cycles. The four segments are shown in Figures 4G–J. Without the lateral potential, the network cannot distinguish major image regions from noisy fragments and separate major regions apart. With a fixed set of parameters, the dynamical system of LEGION can segment only a limited number of patterns. This number depends to a large extent on the ratio of the times that a single oscillator spends in the silent and active phases. Let us refer to this limit as the segmentation capacity of LEGION. In the above simulation, the number of the major blocks to be segmented is within the segmentation capacity. What happens if this number exceeds the segmentation capacity? From the analysis in Wang and Terman
Image Segmentation
819
Figure 5: Temporal evolution of every stimulated oscillator. The upper three traces show the combined x activities of the three oscillator blocks representing the three corresponding patterns indicated by their respective labels. The fourth trace shows the temporal activities of the loners, and the bottom trace shows the activity of the global inhibitor. The ordinates indicate the normalized activity of an oscillator or the inhibitor. The simulation took 9000 integration steps.
(1996), we know that the system will separate the entire image into as many segments as the capacity allows, where each segment may correspond to one major block (called a simple segment) or a number of major blocks (called a congregate segment). To illustrate this point, we show the following simulation, where we present to a 30 × 30 LEGION network with an arbitrary image containing nine binary patterns, which together form the phrase OHIO STATE as shown in Figure 6A. We then add 10 percent random noise to the input in a similar way as in Figure 4, resulting in Figure 6B. We use the same parameter values as in the simulations presented in Figure 4, except that γ = 8.0. For this set of parameters, our earlier experiments showed that the system’s segmentation capacity is less than 9. The simulation results are presented in Figures 6C–H. Shortly after the start of system evolution, the LEGION network segmented the input of Figure 6B into five segments, shown in Figures 6D–H, respectively. Among these five segments, three are simple segments (6D, 6E, and 6H) and two are congregate segments (6F and 6G). Besides Figure 6, many other simulations have been performed for the input of Figure 6B with different random initial conditions, and the results are comparable with Figure 6. There are different ways, however, that the
820
DeLiang Wang and David Terman
Figure 6: (A) An image composed of nine patterns mapped to a 30 × 30 LEGION network. See the legend of Figure 4 for explanations. (B) The image in A is corrupted by 10 percent noise. (C) A snapshot of network activity at the beginning of dynamic evolution. (D–H) Snapshots subsequently taken shortly after C.
system separates the nine noisy patterns into five segments. For this particular set of parameters, the segmentation capacity of the LEGION network is five. In fact, we have not seen a single simulation trial where more than five segments are produced. This important property of the system—it naturally exhibits a segmentation capacity—is in good accord with the well-known psychological principle that there are fundamental limits on the number of simultaneously perceived objects.
Image Segmentation
821
5 Real Images LEGION can segment gray-level images in a way similar to segmenting binary images. For a given image, a LEGION network of the same size as the image with a global inhibitor is used to perform segmentation. Each pixel of the image corresponds to an oscillator of the network, and we assume that every oscillator is stimulated when the image is applied to the network. The main difference between gray-level and binary images lies in how to set up connections. For gray-level images, the coupling strength between two neighboring oscillators is determined by the similarity of two corresponding pixels. This simple way of setting up the coupling strength addresses only the grouping principles of proximity, connectedness, and similarity (cf. section 1). 5.1 Algorithm. To segment real images with large numbers of pixels involves integrating a large number of the differential equations (equations 3.1–3.5). To reduce numerical computations on a serial computer, an algorithm is extracted from these equations. The algorithm follows major steps in the numerical simulation of the equations, and it exhibits the essential properties of relaxation oscillators, such as two time scales (fast and slow) and the properties of synchrony and desynchrony in a population of oscillators. Such extraction is quite straightforward because, in a relaxation oscillator network, much of the dynamics takes place when oscillators are jumping up or jumping down. Besides, the algorithm overcomes the segmentation capacity, which may be desired in some applications. More specifically, the following approximations have been made: 1. When no oscillator is in the active phase (see Figure 2), the leader closest to the jumping point (left knee) among all enabled oscillators is selected to jump up to the active phase. 2. An oscillator takes one time step to jump up to the active phase if the net input it receives from neighboring oscillators and the global inhibitor is positive. 3. The alternation between the active phase and the silent phase of a single oscillator takes one time step only. 4. All of the oscillators in the active phase jump down if no more oscillators can jump up. This situation occurs when the oscillators stimulated by the same pattern have all jumped up. LEGION algorithm. Only the x value of oscillator i, xi , is used in the algorithm. N(i) is assumed to be the eight nearest neighbors of i without wraparound. LKx , RKx , LCx represent the x values of three of four corner points of a typical limit cycle (see Figure 2A), where LC denotes the (upper) left corner of the limit cycle. By straightforward calculations, we obtain
822
DeLiang Wang and David Terman
LKx = −1, LCx = −2, RKx = 1. In the algorithm, Ii indicates the value of pixel i, and IM indicates the maximum possible pixel value. 1. Initialize 1.1 Set z(0) = 0; 1.2 Form effective connections
Wij = IM /(1 + |Ii − Ik |),
k ∈ N(i)
1.3 Find leaders " # X pi = H Wik − θp k∈N(i)
1.4 Place all the oscillators randomly on the left branch. Namely, xi (0) takes a random value between LCx and LKx . 2. Find one oscillator j so that (1) xj (t) ≥ xk (t), where k is currently on the left branch and Pk = 1; (2) pj = 1. Then:
xj (t + 1) = RKx ; z(t + 1) = 1
{jump up}
xk (t + 1) = xk (t) + (LKx − xj (t)),
for k 6= j and Pk = 1.
In this step, the leader on the left branch, which is closest to the left knee, is selected. This leader jumps up to the right branch, and all the other leaders move toward LK. 3. Iterate until stop If (xi (t) = RKx and z(t) > z(t − 1)) {stay on the right branch} xi (t + 1) = xi (t) else if (xi (t) = RKx and z(t) ≤ z(t − 1)) {jump down} xi (t) = LCx ; z(t + 1) = z(t) − 1 If (z(t + 1) = 0) go to step 2 else P Si (t + 1) = k∈N(i) Wik H(xk (t)) − Wz H(z(t) − 0.5) If (Si (t + 1) > 0) {jump up} xi (t + 1) = RKx ; z(t + 1) = z(t) + 1 else xi (t + 1) = xi (t) {stay on the left branch}
In a comparison of this algorithm with the dynamical system of equations 3.1–3.5, one can find the following simplifications. 1. The dynamic weight Wij is directly set to Wij = IM /(1 + |Ii − Ik |). The intuitive reason for this choice of weights is that the more pixel
Image Segmentation
823
i and pixel j are similar to each other, the stronger is the connection between the two corresponding oscillators. It is worth noting that the algorithm does not compute normalized weights. As mentioned in section 3, selective gating can still take place with this weight setting, even though the weights are not normalized. 2. The leaders are chosen during initialization. According to the dynamics described in section 3, lateral potentials, and thus leaders, are determined during a few initial cycles of oscillatory dynamics. Since every oscillator is stimulated and Wij is set at the beginning, it can be precisely predicted at the beginning which oscillators will become leaders. Thus, to save computational time, the leaders are determined in the initialization step. It should be clear that the number of leaders determined in this step does not correspond to the number of resulting segments; a major image region (segment) may generate many leaders. There are two critical parameters in the algorithm: Wz and θp , where the former is the strength of global inhibition and the latter is the threshold for forming high potentials (leaders). For Wz , higher values make the algorithm more difficult to group pixels into regions. Thus, in order for a region to be grouped together, the algorithm demands a higher degree of homogeneity within the region. Generally, given a gray-level image, higher Wz leads to more and smaller regions. For θp , higher values make the algorithm more difficult to develop leaders. Thus fewer leaders will be developed, and fewer regions result from the algorithm. On the other hand, regions produced with a higher θp tend to be more homogeneous. For image segmentation applications, it suffices to stop the algorithm when every leader has jumped up once. (See Wang & Terman, 1996, for some discussions on the algorithm.) 5.2 Segmenting Real Images. 5.2.1 Summation versus Maximation: An Aerial Image. The first image the algorithm is tested on is an aerial image, called Lake, which is shown in Figure 7A. As in the following images to be used, this is a typical gray-level image, where each pixel is an 8-bit number ranging from 0 to 255 (also called intensity), and pixels with higher values appear brighter. The image has 160 × 160 pixels and is presented to a LEGION network of 160 × 160 oscillators. For this simulation, Wz = 40 and θp = 1200. Quickly after the image is presented, the algorithm produces different segments at different time steps. Figures 7B–G display the first six segments that have been produced sequentially, where a black pixel corresponds to an oscillator in the active phase and a blank pixel corresponds to an oscillator in the silent phase. Each segment in the figure corresponds to a meaningful region in the original image: A segment is a lake, a field, or a parkway. The region in Figure 7B corresponds to a lake. The region in Figure 7C corresponds to the
824
DeLiang Wang and David Terman
Figure 7: (A) A gray-level image consisting of 160 × 160 pixels (courtesy of K. Boyer). (B–G) Segments popped out subsequently from the network shortly after the LEGION algorithm is executed.
main lake, except for the lower-left part and on the right side, where the lake region extends to nonlake parts. The parkway segment in Figure 7G picks up a partial parkway network in the original image. The other segments match well with the fields of the image. The entire image is separated into 16 regions and a background. To simplify the display, we put all the segments and the background together into one figure, using gray levels to indicate the phases of oscillator blocks. Such a display is called a gray map. The gray map of the results of this simulation is shown in Figure 8A, where the background is shown by the black scattered areas. Generally, the background corresponds to parts of the im-
Image Segmentation
825
Figure 8: (A) A gray map showing the result of segmenting Figure 7A. The algorithm produces 16 segments plus a background. (B) The result of another segmentation using maximization to compute Si . The algorithm produces 17 segments plus a background. (C) The result of another segmentation similar to B but with a different value for θp . The system produces 23 segments plus a background. The algorithm was run for 1000 steps for every case.
age that have high-intensity variations. Due to the mechanism of fragment removal, these noisy regions stay in the background, as opposed to many segments that would have been the result without fragment removal. In the above algorithm, an oscillator summates all the input from its neighborhood, and if the overall input is greater than the global inhibition,
826
DeLiang Wang and David Terman
the oscillator jumps to the active phase (see also equation 3.3). Another reasonable way of grouping is to replace summation by maximization when computing Si : Si = Maxk∈N(i) {Wik H(xk − θx )} − Wz H(z − θxz ).
(5.1)
Intuitively, the maximum operation concentrates on the relation of i with the oscillator in N(i) that has the strongest coupling with i but omits the relation between i and N(i) as a whole. Thus, grouping by maximization emphasizes pairwise pixel relations, whereas grouping by summation emphasizes pixel relations in a local field. By using equation 5.1 in the LEGION algorithm, the Lake image is segmented again. In this simulation, Wz = 20 and θp = 1200. Figure 8B shows the result of segmentation by a gray map. The entire image is segmented into 17 regions and a background, which is indicated by black areas. Each segment corresponds well with a relatively homogeneous region in the image. Interestingly, except for the parkway region in the lower part of the image, every region in Figure 8B has a corresponding one in Figure 8A. A comparison between the two figures reveals the difference between summation and maximization in segmentation. A closer comparison, however, indicates that the maximum scheme yields a little more faithful regions. On the other hand, regions in Figure 8A appear smoother and have fewer “black holes”—parts of the background. The smoothing effect of summation is generally positive, but it may lose important details. For example, the sizable hole inside the main lake region of Figure 8B corresponds to an island in the original image, which is neglected in the main lake region of Figure 8A. Another distinction is that grouping in the maximum scheme is symmetrical in the sense that if pixel a can recruit pixel b, then b can recruit a as well. This is because effective weights are symmetrical, namely, Wij = Wji (see the algorithm). Because the maximum scheme appears to produce better results, it will be used in all of the following simulations. To show the effects of parameters, we reduce the value of θp from 1200 in Figure 8B to 1000. As a result, more regions are segmented, as shown in Figure 8C, where 23 segments plus a background are produced by the algorithm. Compared with Figure 8B, the notable new segments include an open-theater-like region to the left of the main lake and its nearby field region. The Lake image has been used by Sarkar and Boyer (1993a), whose approach is edge based. Readers are encouraged to compare our results with theirs. 5.2.2 MRI Images. The next image to test our algorithm is an MRI (magnetic resonance imaging) image of a human head (see Figure 9A). MRI images constitute a large class of medical images, and their automatic processing is of great practical value. This particular image, which we denote as Brain-1, is a midsagittal section, consisting of 257×257 pixels. Salient regions
Image Segmentation
827
of this picture are the cerebral cortex, the cerebellum, the brainstem, the corpus callosum, the fornix (the bright stripe below the corpus callosum), the septum pellucidum (the region surrounded by the corpus callosum and the fornix), the extracranial soft tissue (the bright stripe on top of the head), the bone marrow (scattered stripes under the extracranial tissue), and several other structures (for the nomenclature see Kandel, Schwartz, & Jessell, 1991, p. 318). For this image, a LEGION network of 257 × 257 oscillators is used, and Wz = 25 and θp = 800. Figure 9B shows the result of one simulation by a gray map. The Brain-1 image is segmented into 21 regions plus a background, indicated by the black areas. Of particular interest are two parts of the brain: (1) the upper part and (2) the brainstem with part of the spinal cord (“brainstem” for short), parts of the extracranial tissue, and parts of the bone marrow. Other interesting segments are the neck part, the chin part, the nose part, and the vertebral segment. Although it is useful to treat the brain as a whole in some circumstances, it may also be desirable to segment the brain into more detailed structures. To achieve this in LEGION, one can increase Wz . But in order not to produce too many regions, θp should usually be increased as Wz increases. To show the combined effects, Figure 9C displays the result of another run with Wz = 40 and θp = 1000, where Brain-1 is segmented to 25 regions plus a background. Now the upper part of the brain is further segmented into the cerebral cortex, the cerebellum, the callosum-fornix region, and its surrounding septum. Because of the higher Wz , regions in Figure 9C tend to contain more background (compare, for example, the two brainstem regions). However, in the cortex segment in Figure 9C, the noisy stripes actually have physical meanings; they tend to match with various fissures on the cerebral cortex. With the higher θp , some segments in Figure 9B cannot generate any leaders and thus join the background. The final segmentation uses another MRI image of a human head, shown in Figure 10A. This image is denoted as Brain-2, consisting of 257×257 graylevel pixels. Brain-2 is a sagittal section through one eye. Salient regions of this picture are the cortex, the cerebellum, the lateral ventricle (the black hole within the cortex), the eye, the sinus (the black hole below the eye), the extracranial soft tissue, and the bone marrow. A LEGION network with 257 × 257 oscillators is used for the segmentation task. In the first simulation, Wz = 20 and θp = 800; Figure 10B shows the result by a gray map. Brain-2 is segmented into 17 regions plus a background, indicated by black scattered areas. One can see from the figure that the entire brain forms a single segment. Other significant segments are the eye, the sinus, parts of the bone marrow, and parts of the extracranial tissue. The lateral ventricle is put into the background. In order to generate finer structures, Wz is raised to 35 in the second simulation. As in Figure 9, θp is increased to 1000. Figure 10C shows the result of this simulation, where Brain-2 is segmented into 13 regions plus a background. As expected, the segments in Fig. 10B become further segmented
828
DeLiang Wang and David Terman
Figure 9: (A) A gray-level image consisting of 257 × 257 pixels (courtesy of N. Shareef). (B) A gray map showing the result of segmenting the image by a 257 × 257 LEGION network. The system produces 21 segments plus a background. (C) The result of another segmentation with different values of Wz and θp . The system produces 25 segments plus a background. The algorithm was run for 1200 steps in both B and C.
or shrunk, and the background becomes more extensive. Worth mentioning is that the brain segment in Figure 10B is segmented into three segments: one corresponding to the cortex and the other two corresponding to the cerebellum. Due to the increase of θp , the segments corresponding to the extracranial tissue and the marrow in Figure 10B disappear in Figure 10C.
Image Segmentation
829
Figure 10: (A) A gray-level image consisting of 257 × 257 pixels (courtesy of N. Shareef). (B) A gray map showing the result of segmenting the image by a 257 × 257 LEGION network. The system produces 17 segments plus a background. (C) The result of another segmentation with different values of Wz and θp . The system produces 13 segments plus a background. The algorithm was run for 1200 steps in both B and C.
In the segmentation experiments of this section, our goal was to illustrate the LEGION mechanism and the potential effectiveness of derived algorithms. We did not attempt to produce best possible results by fine tuning of parameters. One can easily tell this by the simple rule of setting Wij , the simple choice of N(i), and the values of Wz and θp that have been
830
DeLiang Wang and David Terman
used in the simulations. Therefore, better results can be expected by using more sophisticated schemes of choosing these parameters. Indeed, a more elaborate version of the LEGION algorithm has been applied to segment three-dimensional MRI and CT (computerized tomography) images, and good segmentation results have been obtained (Shareef et al., in preparation). 6 Discussion 6.1 Further Remarks on LEGION Computation. In the simulations of section 5, N(i) is set to the eight nearest neighbors of i. Larger N(i)’s entail more computations for determining leaders. But larger N(i)’s have more flexibility in specifying the conditions for creating leaders, which tends to produce better results. Also, the order of pop-out of different segments is currently random. In some situations, however, it might be useful to influence the order of pop-out by some criteria, such as the size of each region. LEGION may incorporate different criteria by ordering leaders accordingly, for example, by using the global inhibitor in different ways. The main difference between our approach to image segmentation and other segmentation algorithms reviewed in section 2 is that ours is neurocomputational, building on the strengths and the constraints of neural computation. Our approach relies on emergent behavior of LEGION to embody the computation involved in image segmentation. Methodologically, the computation is performed by a population of active agents—oscillators, in this case—that are driven by pixels, whereas in a typical segmentation algorithm, pixels are data to be processed by a central agent—the algorithm or the neural network trained as a pixel classifier. That LEGION is a massively parallel network of dynamical systems with mainly local coupling makes it particularly feasible for analog very large scale integration implementation, the success of which would be a major step toward real-time processing of scene segmentation. A thorny issue with scene segmentation is that often no unique answer exists. A house, for example, may be grouped into a single segment if viewed afar. The same house, if viewed nearby, may be broken into multiple segments, including a door, a roof, windows, and other elements. This situation demands a flexible treatment of scene segmentation; a system should be able to generate multiple sets of segmentation easily, each of which should be reasonable. In LEGION, this flexibility is reflected to a certain degree by the effects of the parameters of Wz and θp , as discussed in section 5. As noted there, both Figure 9B and Figure 9C offer arguably reasonable results. The LEGION network used so far has only one layer of oscillators. In section 4, we mentioned that the oscillatory dynamics of one-layer LEGION has a limited segmentation capacity (see Figure 6). It is interesting to note that the human perceptual system is also limited in simultaneously attending to the objects in a scene (Miller, 1956). We expect that the ability of LEGION
Image Segmentation
831
improves significantly when multiple layers are used in subsequent stages. This multistage processing provides a natural way out of this fundamental limitation. In multistage processing, each layer does not need to segregate more than several segments, and yet the system as a whole can segregate many more segments than the segmentation capacity, an idea reminiscent of chunking proposed by Miller (1956). When the number of segments in an image is greater than the segmentation capacity, one-layer LEGION will produce a number of segments (simple or congregate) up to the segmentation capacity (see Figure 6). Congregate segments, however, can be further segmented with another layer of LEGION, whereas simple segments will not segment further. With multistage processing, the hierarchical system can provide results of both coarse- and fine-grain segmentation. 6.2 Biological Relevance. The relaxation-type oscillator used in LEGION is dynamically very similar to numerous other oscillators used in modeling neuronal behavior. Examples include the FitzHugh-Nagumo equations (FitzHugh, 1961; Nagumo, Arimoto, & Yoshizawa, 1962) and the Morris-Lecar model (Morris & Lecar, 1981). These can all be viewed as simplifications of the Hodgkin-Huxley equations (Hodgkin & Huxley, 1952). In equation 3.1, the variable x corresponds to the membrane potential of the neuron, and y corresponds to the channel activation or inactivation state variable, which evolves on the slowest time scale. The reduction from a full Hodgkin-Huxley model to the two-variable model is achieved by assuming that the other, faster, channel state variables are instantaneous. The dynamics of the lateral potential as given in equation 3.2 has properties similar to those of certain membrane channels and excitatory chemical synapses. The N-methyl-D-aspartate channel, for example, turns off on a slow time scale (Traub & Miles, 1991). Moreover, with a sufficiently large input, a cell with these channels can be transformed from the excitatory to the oscillatory mode. We note that the lateral potential does not act as a temporal integrator of all the input converging on its corresponding oscillator, but utilizes sharp nonlinearity as embodied by the outer Heaviside function in equation 3.2. The theory of oscillatory correlation is consistent with the growing body of evidence that supports the existence of neural oscillations in the visual cortex and other brain regions. In the visual system, synchronous oscillations have been observed in cell recordings of the cat visual cortex (Eckhorn et al., 1988; Gray, Konig, ¨ Engel, & Singer, 1989). These neural oscillations are stimulus dependent and range from 30 to 70 Hz, often referred to as 40-Hz oscillations. Also, synchronous oscillations (locking with zero phase lag) occur across an extended brain region only if the stimulus constitutes a coherent object. These basic findings have been confirmed repeatedly in different brain regions and in different animal species (for reviews see Buzs´aki, Llin´as, Singer, Berthoz, & Christen, 1994; and Singer & Gray, 1995). The local excitatory connections assumed in LEGION conform with various lateral connections in the brain. Relating to the visual cortex, these ex-
832
DeLiang Wang and David Terman
citatory connections, which link the excitatory elements of oscillators, could be interpreted as the horizontal connections in the visual cortex (Gilbert & Wiesel, 1989; Gilbert, 1992). It is known that horizontal connections originate from pyramidal cells, which are of excitatory type, and pyramidal cells are also the principal target of the horizontal connections. Furthermore, at the functional level, physiological recordings from monkeys suggest that motion-based visual segmentation may be processed in the primary visual cortex (Stoner & Albright, 1992; Lamme, van Dijk, & Spekreijse, 1993). The global inhibitor (see Figure 3) receives input from the entire oscillator network and feeds back inhibition onto the network. It serves to segment multiple patterns simultaneously present in a visual scene, thus exerting a global coordination. Crick (1984) has suggested that part of the thalamus, the thalamic reticular complex in particular, may be involved in the global control of selective attention. The thalamus is uniquely located in the brain; it receives input from and sends projections to almost the entire cortex. This suggestion and key anatomical and physiological properties of the thalamus prompt us to speculate that the global inhibitor might correspond to a neuronal group in the thalamus. The activity of the global inhibitor should be interpreted as the collective behavior of the neuronal group. 6.3 Figure-Ground Segregation. The dynamics proposed in this article separates a scene into a number of major segments and a background, which corresponds to the rest of the scene. The major segments combine to form the foreground, whose corresponding oscillators are oscillatory until the input scene fades away. After a brief beginning period, the oscillators corresponding to the background become excitable and stop oscillating. This dynamics effectively gets rid of noisy fragments without either prior smoothing or postprocessing of removing small regions, the methods often used in segmentation algorithms. With this dynamics, typical figureground segregation can be characterized as a special case, where only one major segment is allowed to be separated from the scene. In this sense, we claim that our dynamics also provides a potential solution to the problem of figure-ground segregation. We allow a foreground to include multiple segments so that both scene segmentation and figure-ground segregation are incorporated in a unified framework. 6.4 Future Topics. In this study, we have not addressed the role of prior knowledge in image segmentation. For example, when people segment the images of Figures 9A and 10A, they inevitably use their knowledge of human anatomy, which describes among other things the relative size and position of major brain regions. A more complete system of image segmentation must address this issue. Besides prior knowledge, many grouping principles outlined in section 1 have not been incorporated into the system. One of the main future topics is to incorporate more grouping cues into the system. The global inhibitory mechanism will play a key role in overall
Image Segmentation
833
system coordination; it makes various factors compete with each other, and a final segment is formed because of strong binding within the segment. Our study in this article focuses exclusively on visual segmentation. It should be noted that neural oscillations occur in other modalities as well, including audition (Galambos, Makeig, & Talmachoff, 1981; Ribary et al., 1991) and olfaction (Freeman, 1978). Strikingly, these oscillations in different modalities show comparable frequencies. A recent study extended LEGION to deal with auditory scene segregation (Wang, 1996). With its computational properties and its biological relevance, the oscillatory correlation approach promises to provide a general neurocomputational theory for scene segmentation and perceptual organization. Acknowledgments We thank the two anonymous referees for their constructive comments. D.L.W. was supported in part by ONR grant N00014-93-1-0335, NSF grant IRI-9423312, and ONR YIP Award N00014-96-1-0676. D.T. was supported in part by NSF grant DMS-9423796. We thank S. Campbell for preparing Figure 2 and E. Cesmeli for his assistance in preparing Figure 7. References Abeles, M. (1982). Local cortical circuits. New York: Springer-Verlag. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. New York: Cambridge University Press. Adams, R., and Bischof, L. (1994). Seeded region growing. IEEE Trans. Pattern Anal. Machine Intell., 16, 641–647. Barlow, H. B. (1972). Single units and cognition: A neurone doctrine for perceptual psychology. Percept., 1, 371–394. Buzs´aki, G., Llin´as, R., Singer, W., Berthoz, A., & Christen, Y. (Eds.). (1994). Temporal coding in the brain. Berlin: Springer-Verlag. Crick, F. (1984). Function of the thalamic reticular complex: The searchlight hypothesis. Proc. Natl. Acad. Sci. USA, 81, 4586–4590. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex. Biol. Cybern., 60, 121–130. FitzHugh, R. (1961). Impulses and physiological states in models of nerve membrane. Biophys. J., 1, 445–466. Foresti, G., Murino, V., Regazzoni, C. S., & Vernazza, G. (1994). Grouping of rectilinear segments by the labeled Hough transform. CVGIP: Image Understanding, 58(3), 22-42. Freeman, W. J. (1978). Spatial properties of an EEG event in the olfactory bulb and cortex. Electroencephalogr. Clin. Neurophysiol., 44, 586–605. Galambos, R., Makeig, S., & Talmachoff, P. J. (1981). A 40-Hz auditory potential recorded from the human scalp. Proc. Natl. Acad. Sci. USA, 78, 2643–2647.
834
DeLiang Wang and David Terman
Geman, D., Geman, S., Graffigne, C., & Dong, P. (1990). Boundary detection by constrained optimization. IEEE Trans. Pattern Anal. Machine Intell., 12, 609– 628. Gilbert, C. D. (1992). Horizontal integration and cortical dynamics. Neuron, 9, 1–13. Gilbert, C. D., & Wiesel, T. N. (1989). Columnar specificity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci., 9, 2432–2442. Gove, A., Grossberg, S., & Mingolla, E. (in press). Brightness perception, illusory contours, and corticogeniculate feedback. Visual Neurosci. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Grossberg, S., & Mingolla, E. (1985). Neural dynamics of perceptual grouping: Textures, boundaries, and emergent computations. Percept. Psychophys., 38, 141–171. Grossberg, S., & Wyse, L. (1991). A neural network architecture for figure-ground separation of connected scenic figures. Neural Net., 4, 723–742. Gurari, E. M., & Wechsler, H. (1982). On the difficulties involved in the segmentation of pictures. IEEE Trans. Pattern Anal. Machine Intell., 4(3), 304–306. Haralick, R. M. (1979). Statistical and structural approaches to texture. Proc. IEEE, 67, 786–804. Haralick, R. M., and Shapiro, L. G. (1985). Image segmentation techniques. Comput. Graphics Image Process., 29, 100–132. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (Lond.), 117, 500–544. Horowitz, S. L., & Pavlidis, T. (1976). Picture segmentation by a tree traversal algorithm. J. ACM, 23, 368–388. Kandel, E. R., Schwartz, J. H., & Jessell, T. M. (1991). Principles of neural science (3rd ed.). New York: Elsevier. Koffka, K. (1935). Principles of Gestalt psychology. New York: Harcourt. Koh, J., Suk, M., & Bhandarkar, S. M. (1995). A multilayer self-organizing feature map for range image segmentation. Neural Net., 8, 67–86. Kohler, R. (1981). A segmentation system based on thresholding. Comput. Graphics Image Process., 15, 319–338. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag. Lamme, V. A. F., van Dijk, B. W., & Spekreijse, H. (1993). Coutour from motion processing occurs in primary visual cortex. Nature, 363, 541–543. Liou, S. P., Chiu, A. H., & Jain, R. C. (1991). A parallel technique for signal-level perceptual organization. IEEE Trans. Pattern Anal. Machine Intell., 13, 317–325. Manjunath, B. S., & Chellappa, R. (1993). A unified approach to boundary perception: Edges, textures, and illusory contours. IEEE Trans. Neural Net., 4, 96–108. Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol. Rev., 63, 81–97. Milner, P. M. (1974). A model for visual shape recognition. Psychol. Rev., 81(6), 521–535.
Image Segmentation
835
Mohan, R., & Nevatia, R. (1992). Perceptual organization for scene segmentation and description. IEEE Trans. Pattern Anal. Machine Intell., 14, 616–635. Morris, C., and Lecar, H. (1981). Voltage oscillations in the barnacle giant muscle fiber. Biophys. J., 35, 193–213. Mozer, M. C., Zemel, R. S., Behrmann, M., & Williams, C. K. I. (1992). Learning to segment images using dynamic feature binding. Neural Comp., 4, 650–665. Murata, T., & Shimizu, H. (1993). Oscillatory binocular system and temporal segmentation of stereoscopic depth surfaces. Biol. Cybern., 68, 381–390. Nagumo, J., Arimoto, S., & Yoshizawa, S. (1962). An active pulse transmission line simulating nerve axon. Proc. IRE, 50, 2061–2070. Ribary, U., Ioannides, A. A., Singh, K. D., Hasson, R., Bolton, J. P. R., Lado, F., Mogilner, A., & Llinas, R. (1991). Magnetic field tomography of coherent thalamocortical 40-Hz oscillations in humans. Proc. Natl. Acad. Sci. USA, 88, 11037–11041. Rock, I., & Palmer, S. (1990). The legacy of Gestalt psychology. Sci. Am., 263, 84–90. Sarkar, S., & Boyer, K. L. (1993a). Integration, inference, and management of spatial information using Bayesian networks: Perceptual organization. IEEE Trans. Pattern Anal. Machine Intell., 15, 256–274. Sarkar, S., & Boyer, K. L. (1993b). Perceptual organization in computer vision: A review and a proposal for a classificatory structure. IEEE Trans. Syst. Man Cybern., 23, 382–399. Schalkoff, R. (1992). Pattern recognition: Statistical, structural and neural approaches. New York: Wiley. Schillen, T. B., & Konig, ¨ P. (1994). Binding by temporal structure in multiple feature domains of an oscillatory neuronal network. Biol. Cybern., 70, 397– 405. Sejnowski, T. J., & Hinton, G. E. (1987). Separating figure from ground with a Boltzmann machine. In M. A. Arbib & A. R. Hanson (Eds.), Vision, Brain, and Cooperative Computation (pp. 703–724). Cambridge, MA: MIT Press. Shareef, N., Wang, D. L., & Yagel, R. Segmentation of medical images using LEGION. In preparation. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Ann. Rev. Physiol., 55, 349–374. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Ann. Rev. Neursci., 18, 555–586. Somers, D., & Kopell, N. (1995). Waves and synchrony in networks of oscillators of relaxation and non-relaxation type. Physica D, 89, 169–183. Sompolinsky, H., Golomb, D., & Kleinfeld, D. (1991). Cooperative dynamics in visual processing. Phys. Rev. A, 43, 6990–7011. Sporns, O., Tononi, G., & Edelman, G. M. (1991). Modeling perceptual grouping and figure-ground segregation by means of active re-entrant connections. Proc. Natl. Acad. Sci. USA, 88, 129–133. Stoner, G. R., & Albright, T. D. (1992). Neural correlates of perceptual motion coherence. Nature, 358, 412–414. Terman, D., & Wang, D. L. (1995). Global competition and local cooperation in a network of neural oscillators. Physica D, 81, 148–176.
836
DeLiang Wang and David Terman
Traub, R. D., & Miles, R. (1991). Neuronal networks in the hippocampus. New York: Cambridge University Press. von der Malsburg, C. (1981). The correlation theory of brain function. (Internal Rep. 81-2). Tubingen: ¨ Max-Planck-Institute for Biophysical Chemistry. von der Malsburg, C., & Buhmann, J. (1992). Sensory segmentation with coupled neural oscillators. Biol. Cybern., 54, 29–40. von der Malsburg, C., & Schneider, W. (1986). A neural cocktail-party processor. Biol. Cybern., 54, 29–40. Wang, D. L. (1993a). Modeling global synchrony in the visual cortex by locally coupled neural oscillators. In Proc. of 15th Ann. Conf. Cognit. Sci. Soc. (pp. 1058– 1063). Boulder, CO. Wang, D. L. (1993b, August). Pattern recognition: Neural networks in perspective. IEEE Expert, 8, 52–60. Wang, D. L. (1995). Emergent synchrony in locally coupled neural oscillators. IEEE Trans. Neural Net., 6(4), 941–948. Wang, D. L. (1996). Primitive auditory segregation based on oscillatory correlation. Cognit. Sci., 20, 409–456. Wang, D. L., Buhmann, J., & Malsburg, C. v. d. (1990). Pattern segmentation in associative memory. Neural Comp., 2, 95–107. Wang, D. L., & Terman, D. (1995a). Locally excitatory globally inhibitory oscillator networks. IEEE Trans. Neural Net., 6(1), 283–286. Wang, D. L., & Terman, D. (1995b). Image segmentation based on oscillatory correlation. In Proc. of World Congress on Neural Net. (pp. II.521–525). Wang, D. L., & Terman, D. (1996). Image segmentation based on oscillatory correlation (Tech. Rep. No. 19). Columbus, OH: Ohio State University Center for Cognitive Science. Wertheimer, M. (1923). Untersuchungen zur Lehre von der Gestalt, II. Psychol. Forsch., 4, 301–350. Zucker, S. W. (1976). Region growing: Childhood and adolescence. Comput. Graphics Image Process., 5, 382–399.
Received April 1, 1996; accepted August 20, 1996.
Communicated by Steven Zucker
Stochastic Completion Fields: A Neural Model of Illusory Contour Shape and Salience Lance R. Williams David W. Jacobs NEC Research Institute, Princeton, NJ 08540, USA
We describe an algorithm- and representation-level theory of illusory contour shape and salience. Unlike previous theories, our model is derived from a single assumption: that the prior probability distribution of boundary completion shape can be modeled by a random walk in a lattice whose points are positions and orientations in the image plane (i.e., the space that one can reasonably assume is represented by neurons of the mammalian visual cortex). Our model does not employ numerical relaxation or other explicit minimization, but instead relies on the fact that the probability that a particle following a random walk will pass through a given position and orientation on a path joining two boundary fragments can be computed directly as the product of two vector-field convolutions. We show that for the random walk we define, the maximum likelihood paths are curves of least energy, that is, on average, random walks follow paths commonly assumed to model the shape of illusory contours. A computer model is demonstrated on numerous illusory contour stimuli from the literature. 1 Introduction The completion shape and salience problem is the problem of computing the shape and relative likelihood (as determined by the prior distribution) of the family of curves that potentially connect (or complete) a set of boundary fragments. This is a necessary intermediate step in the solution of the full figural completion problem, which has been previously characterized (Williams & Hanson, 1996) as the problem of building a Huffman-labeled figure representing the boundaries of hidden and visible surfaces (see Figure 1). The fundamental assumption underlying our work is that the paths followed by a particle undergoing a random walk in a lattice of discrete positions and orientations can be used to model the prior probability distribution of the shape of boundary completions (see also Mumford, 1994). We observe that the activity of a population of neurons with receptive fields tuned to discrete positions and orientations (like early areas of primate visual cortex) can be interpreted as a probability distribution describing a particle’s possible location (i.e., the current state of the Markov process defining the Neural Computation 9, 837–858 (1997)
c 1997 Massachusetts Institute of Technology °
838
Lance R. Williams and David W. Jacobs
Figure 1: Williams and Hanson (1996) have characterized figural completion as the problem of building a Huffman-labeled figure from a set of boundary fragments. A Huffman-labeled figure consists of a set of closed, oriented plane curves (representing surface boundary components) together with an explicit indication of relative depth at crossing points. The labeling scheme is based on Huffman’s concept of “X-ray pictures” of smooth objects (Huffman, 1971). (Left) Input stimulus. (Middle) Boundary fragments (thick) and potential boundary completions (thin) represented by cubic Bezier splines of least energy. The stochastic completion field (introduced in this article) is the corresponding parallel distributed representation. (Right) The human percept, represented as a Huffman-labeled figure. The problem of deriving the corresponding parallel distributed representation is a subject for future research.
random walk). Let there exist two subsets of states, P and Q, representing the beginning and ending points of a set of boundary fragments. We call P the set of sources and Q the set of sinks. Our goal is to compute the probability that a particle, initially in state (xp , yp , θp ), for some p ∈ P, will in the course of a random walk (representing a prior on completion shape) pass through state (u, v, φ), on its way to state (xq , yq , θq ), for some q ∈ Q, for all combinations of u, v, and φ. We call this parallel distributed representation of the family of curves that potentially connect pairs of boundary fragments a stochastic completion field. To appreciate better the scope of this article, it will be useful to identify those phenomena for which we believe the stochastic completion field model can and cannot account. Our primary claim is that the mode, magnitude, and variance of the stochastic completion field are related to the shape, salience, and sharpness of perceived illusory contours. However, the definition of stochastic completion field says nothing about the specific forms of brightness stimuli that can elicit illusory contours. In particular, we do not describe a comprehensive theory of how raw image brightnesses are transformed into sources and sinks for our diffusion process. Furthermore, the stochastic completion field says nothing about apparent brightness, which is distinct from salience. Finally, we note that whether a completion is ac-
Stochastic Completion Fields
839
tually perceived cannot be solely a function of local configurational factors, but also depends on the role the completion plays in the larger surface organization (Williams & Hanson, 1996). Specifically, it depends on (1) whether it can be incorporated in a consistent way into a Huffman-labeled figure and (2) whether it is nominally visible (modal) or occluded (amodal). Neither of these can be determined by a process that does not consider the topology of the scene. Consequently, the stochastic completion field can only be regarded as a precursor to a true surface representation. 2 Previous Work The novelty of our method lies in the definition of the stochastic completion field, which explicitly embodies the assumption that the prior distribution of completion shape can be modeled as a random walk. However, other researchers have advanced algorithm- and representation-level theories of figural completion based on orientation fields of various kinds. At the algorithm and representation level, all of these models are outwardly similar. This is because all derive from similar views of the orientation preference structure of the visual cortex, common assumptions about neural computation, and basic considerations (whether explicit or implicit) like translation and rotational invariance. The earliest theory of illusory contour shape is due to Ullman (1976). Ullman hypothesized that the curve used by the human visual system to join two contour fragments is constructed from two circular arcs. Each circular arc is tangent to its sponsoring contour at one end and to the other arc at the point of intersection. From the family of possible R curves of this form, the pair that minimizes total bending energy (E = κ(s)2 ds where κ(s) is curvature parameterized by arc length) models the shape of the illusory contour. Ullman suggested that illusory contours could be computed in parallel in a network by means of local operations, but he did not implement or test this idea. Grossberg and Mingolla (1985) describe a model of illusory contour formation that involves repeated convolution with a large-kernel filter. In outward form, this kernel resembles ours, but it does not represent the Green’s function of a stochastic process of the sort described by us or Mumford (1994). The outputs of oppositely oriented filters are combined through a nonlinear operation consisting of thresholding followed by a logical AND function. Grossberg and Mingolla’s network is complex, and its convergence properties are difficult to analyze. Clearly, part of this complexity stems from their desire to model, in a comprehensive way, the many different forms of stimulus that can elicit illusory contours (e.g., contrast edges of mixed sign, line endings). No other model (including ours) attempts to be this comprehensive. Parent and Zucker (1989) describe a network algorithm for computing a consistent discrete tangent and curvature field from noisy local tangent and
840
Lance R. Williams and David W. Jacobs
curvature measurements. Their network solves a well-defined relaxation labeling problem where the goal is to find the most probable assignment of labels subject to a compatibility relation. The labels are discrete tangent and curvature values at every image point, and the compatibility relation is based on co-circularity. Although they do not claim to model illusory contour shape, David and Zucker (1990) use active minimum-energy-seeking contours to locate the valleys (or, equivalently, the mode or ridge lines) of a potential function derived from the output of the Parent and Zucker network. The ridge lines of the potential function represent the integral curves of the tangent field. Like us, Heitger and von der Heydt (1993) describe a theory of figural completion based on nonlinear combination of the convolutions of “keypoint” images with a fixed set of oriented grouping filters. Significantly, they demonstrate their method on both illusory contour figures like the Kanizsa triangle and on more “realistic” images (e.g., of plants and rocks), with impressive results. Unfortunately, neither the equations defining the filters nor the proposed method of combination are described as a means of computing an underlying function. This makes analysis of their method difficult. However, because their grouping filters are scalar functions, they cannot (even implicitly) model a prior probability distribution of completion shapes in the manner we describe. Guy and Medioni (1996) have described a method for computing a vectorfield representing global image structure from local tangent measurements. Like the Hough transform, the key to their approach is the local summation of a set of global voting patterns. Unlike the Hough transform, the accumulator is spatially registered with the image, and the voting pattern is a vector-field, not a scalar-field. However, because the voting pattern represents orientations that are co-circular to the tangent measurements, the vector-field is nonstochastic (i.e., deterministic), and cannot model the prior distribution of completion shapes. Shashua and Ullman (1988) describe a network algorithm for computing “perceptual saliency,” although they do not portray it as a faithful model of human visual processing. Computing salience (as they define it) requires finding the energy of the minimum energy curve of length n beginning at every position and orientation. This energy function combines a term similar to squared curvature and a term measuring gap size. However, because the energy function also includes terms designed to guarantee convergence, the output of their system can be difficult to anticipate. Alter and Basri (1996) analyze the behavior of this network in detail and point out some of its counterintuitive behavior. 3 Fundamental Assumption Like many other problems in vision, figural completion is ill posed; the visual system cannot know the precise shape of an object’s boundary where
Stochastic Completion Fields
841
it is hidden from view by another object. The best that it can do in such a situation is to make an educated guess. It is conventionally assumed that this guess takes the form of a maximum likelihood estimate. That is, given a prior probability distribution of completion shapes, it is assumed that the visual system chooses the member of the distribution that is most probable. In this article, we propose a somewhat different approach. Instead of computing (just) the maximum likelihood completion shape, we propose that the visual system computes local image plane statistics of the distribution of possible completion shapes. There is more information contained in a distribution than just the location of the mode, and we believe that in the case of the distribution of boundary completion shapes, these other properties are psychophysically meaningful. In particular, we believe that the variance of the distribution about the mode is related to illusory contour sharpness. Like Mumford (1994), we characterize the distribution of completion shapes using the mathematical device of a particle undergoing a stochastic motion (a directional random walk). The random walk embodies the Gestalt principles of proximity and good continuation, which we believe originate in the statistics of the environment our visual systems evolved in. However, apart from shortness and smoothness (which have their origins in the world), we make one additional assumption, the origin of which (we believe) is in our heads. This is the assumption that the distribution of shapes that can extend a contour at any point is totally determined by the position and orientation of the contour at that point, that is, that the distribution of shapes can be modeled as a Markov process. An immediate consequence of this assumption is that for the random walk we define, it can be shown (see the appendix) that the maximum likelihood path followed by a particle joining two points at fixed orientations is a curve of least energy (where energy is a linear combination of the integral of curvature squared and length). This is the curve that, others have theorized (Horn, 1981), models the shape of illusory contours joining boundary fragments with orientation difference significantly less than π/2. However, the more important consequence is computational: The Markov assumption allows us to factor the stochastic completion field into source and sink fields, each of which can be computed by a linear transformation. To summarize, it is certainly possible to define prior distributions for boundary completion shape using devices other than a random walk (e.g., co-circularity). Unfortunately, the resulting priors will not be Markov. Consequently, the maximum likelihood completion shapes will not be curves of least energy, and it will not be possible to decompose the completion field into a product of more elementary fields. Unlike the familiar two-dimensional isotropic random walk, where a particle’s state is simply its position in the plane, the particles of the random walk described by us and Mumford (1994) possess both position and orientation. The random walk itself is defined by two elements: (1) the equations of motion and (2) a decay constant. The equations of motion describe
842
Lance R. Williams and David W. Jacobs
the change in a particle’s position and orientation as a function of time; the decay constant, τ , specifies a particle’s average lifetime. As part of a study of the reduction of search made possible by prior grouping in visual object recognition, Jacobs (1989) computed the probability density of the size of boundary gaps due to occlusion in random juxtapositions of a set of flat polygonal surfaces. Jacobs concluded that small gaps predominate and that incident frequency rapidly drops off with increasing size. By including a decay mechanism in the stochastic process, we are able to model the component of the completion shape probability distribution dependent on 1 length. Because a certain fraction of particles (1 − e− τ ) decay per unit time, longer paths are exponentially less likely. A particle’s position and orientation are updated according to the following stochastic nonlinear differential equation: x˙ = cos θ y˙ = sin θ θ˙ = N(0, σ 2 ), where x˙ and y˙ specify change in position, θ˙ is change in orientation (curvature), and N(0, σ 2 ) is a normally distributed random variable with zero mean and variance equal to σ 2 . There is a strong similiarity between the equations of motion defined here and the Kalman filter equations Cox, Rehg, and Hingorani (1993) employed in their work on edge tracking; a particle moves with constant speed in a direction that is continually changing by some random amount. The effect is that particles tend to travel in straight lines, but over time, they drift to the left or right by an amount proportional to σ 2 (for example random walks; see Figure 2). When σ 2 = 0, the motion is completely deterministic, and particles never deviate from straight paths. When σ 2 = ∞, the motion is completely random, and the stochastic process becomes a two-dimensional isotropic random walk. For this reason, we will sometimes refer to the stochastic process we define as a diffusion process and the parameter, σ 2 as the diffusivity. 4 Problem Formulation The receptive fields of neurons in area V1 of visual cortex are retinotopically organized and narrowly tuned to stimuli of specific position and orientation (Hubel and Wiesel, 1962). Indeed, it has been suggested (see Blasdel & Obermayer, 1994; or Grossberg & Olson, 1994) that the orientation preference structure that characterizes V1 is a consequence of mapping the space of positions and orientations in the plane (R2 ×S1 ) onto the two-dimensional surface of the cortex (R2 ) while maximizing locality. It follows that the activity of a population of neurons in V1 can be represented by a probability density function defined over R2 × S1 and that the network of intercon-
Stochastic Completion Fields
843
Figure 2: (Left) An example of a random walk. (Right) 1000 random walks.
nections can be represented by an order six tensor (R2 × S1 ) × (R2 × S1 ). Conservatively speaking, the function the network computes can be viewed as a tensor product, mapping an input probability density function to an output probability density function through a linear transformation. Let the order six tensor G represent the transition probabilities of a Markov process defined on the three-dimensional state space, R2 × S1 , and satisfying the equations of motion defined above (G is the Green’s function).1 It follows that G(u, v, φ, x, y, θ ; t) is the probability that a random walk of length t will end in state (u, v, φ) given that it began in state (x, y, θ) (see Figure 3). If p(x, y, θ ; 0) is the probability density function describing the position and orientation of a particle at the beginning of its random walk, then the probability that the particle will be at some position and orientation at time t is Z p(u, v, φ ; t) =
Z
∞
dx
−∞
Z
∞
dy
−∞
π
−t
dθ G(u, v, φ, x, y, θ ; t)p(x, y, θ ; 0) · e τ .
−π
Because the probability of a random walk of a given shape is independent of its initial position and orientation in the plane, the order six tensor, G, consists entirely of translated and rotated copies of an order three ten-
1 Carman (1990) has emphasized the usefulness of Green’s functions as descriptions of visual computations, and Carman and Welch (1992, 1993) have specifically proposed that they play a role in the computation of illusory surfaces.
844
Lance R. Williams and David W. Jacobs
Figure 3: (Upper left) P.d.f. representing source distribution, p(x, y, θ ; 0), consists of an impulse located at (0, 0, 0). Although sometimes truncated for clarity, in general, the length of the arrows is proportional to the logarithm of the probability that a particle is located at that position and orientation. (Upper right) Snapshot of diffusion process at t = 15, that is, p(x, y, θ ; 15). This is the result of convolving G(u0 , v0 , φ 0 ; 15) with the p.d.f. representing t = 0. (Lower left and right) Diffusion process at t = 31 and t = 47, respectively.
sor representing the transition probabilities for random walks beginning at (0, 0, 0), G(u, v, φ, x, y, θ ; t) = G(u0 , v0 , φ 0 , 0, 0, 0 ; t), where (u0 , v0 , φ 0 ) is (u, v, φ) in the coordinate system defined by (x, y, θ), so that u0 = (u − x) cos θ + (v − y) sin θ , v0 = −(u − x) sin θ + (v − y) cos θ
Stochastic Completion Fields
845
and φ 0 = φ − θ. Henceforward, we will write G(u0 , v0 , φ 0 ; t) instead of G(u0 , v0 , φ 0 , 0, 0, 0 ; t). The upshot is that by taking advantage of the translational and rotational symmetries of G, the tensor product can be computed by an operation similiar to convolution: Z p(u, v, φ ; t) =
Z
∞
dx
−∞
Z
∞
dy
−∞
π
−t
dθ G(u0 , v0 , φ 0 ; t)p(x, y, θ ; 0) · e τ .
−π
We will show that the stochastic completion field can be expressed as the product of a stochastic source field and a stochastic sink field. We define the stochastic source field, p0 (u, v, φ), to be the fraction of paths that begin in a source state and pass through (u, v, φ) before they decay. If we assume that paths do not self-intersect before they decay, then the fraction of paths that pass through a given position and orientation equals the integral of the position and orientation probability density function over time: p0 (u, v, φ) =
Z
∞
dt p(u, v, φ ; t).
0
By changing the order of integration and defining a new Green’s function, G0 , G0 (u, v, φ) =
Z
∞
dt G(u, v, φ ; t) · e− τ , t
0
the explicit time variable, t, can be suppressed. The result is that the stochastic source field, p0 (u, v, φ), can be computed by convolving the probability density function representing the source distribution with the new Green’s function (see Figure 4): p0 (u, v, φ) =
Z
Z
∞
dx
−∞
Z
∞
dy
−∞
π
dθ G0 (u0 , v0 , φ 0 ) p(x, y, θ ; 0).
−π
Observe that the probability2 that a particle will pass through state (u, v, φ) on its way from a source state, p ∈ P, to a sink state, q ∈ Q, before it decays, is proportional to the product of (1) the probability that a particle beginning in a source will reach (u, v, φ) before it decays (the source field) and (2) the probability that a particle beginning at (u, v, φ) will reach a 2
Strictly speaking, what we are able to compute is a relative likelihood, not a probability. This is because we do not compute the absolute probability of a particle diffusing from a source to a sink by any path at all, which would represent the normalizing constant. So although it is possible to convert these relative likelihoods to probabilities, in practice this is not necessary, because relative likelihoods suffice for the purpose of comparing competing paths.
846
Lance R. Williams and David W. Jacobs
Figure 4: Stochastic source fields, p0 (u, v, φ), representing the probability that a particle leaving (0, 0, 0) will reach (u, v, φ) before it decays. Basically, these are plots of the Green’s function G0 (the time integral of G), for a range of diffusivities and decay constants. (Upper left) (σ 2 = 0.05, τ = 100). Upper right: (σ 2 = 0.05, τ = 20). (Lower left) (σ 2 = 0.01, τ = 100). Lower right: (σ 2 = 0.01, τ = 20).
sink before it decays (the sink field). This is an immediate consequence of the Markov property of the stochastic process. We have already described how the source field is computed. We now show that the sink field can be computed in an analogous manner. The magnitude of the sink field at (u, v, φ) is the product of the probability that a particle beginning at (u, v, φ) will reach (x, y, θ) before it decays (i.e., G0 (x, y, θ, u, v, φ)) and the probability that a sink exists at (x, y, θ) (i.e.,
Stochastic Completion Fields
847
q(x, y, θ ; 0)), integrated over all (x, y, θ): q0 (u, v, φ) =
Z
Z
∞
dx
Z
∞
dy
−∞
−∞
π
dθ G0 (x, y, θ, u, v, φ)q(x, y, θ ; 0).
−π
Like G, the order six tensor G0 consists entirely of translated and rotated copies of an order three tensor representing the transition probabilities for random walks beginning at (0, 0, 0), G0 (x, y, θ, u, v, φ) = G(x0 , y0 , θ 0 , 0, 0, 0), where (x0 , y0 , θ 0 ) is (x, y, θ ) in the coordinate system defined by (u, v, φ), so that x0 = (x − u) cos φ + (y − v) sin φ, y0 = −(x − u) sin φ + (y − v) cos φ and θ 0 = θ − φ. Consequently (like the source field), the sink field can be computed by a kind of convolution: q0 (u, v, φ) =
Z
Z
∞
dx
−∞
Z
∞
dy
−∞
π
dθ G0 (x0 , y0 , θ 0 ) q(x, y, θ ; 0).
−π
Finally, the stochastic completion field, C(u, v, φ), which represents the relative likelihood that a particle leaving a source state will pass through (u, v, φ) and enter a sink state before it decays, equals the product of the source and sink fields (see Figure 5): C(u, v, φ) = p0 (u, v, φ) · q0 (u, v, φ). At this point, it is natural to ask how the computation we have just outlined might be implemented. In particular, we wish to identify the neural loci of the representations that form the basis of our model (the source, sink, and completion fields). To begin, we note that it would be somewhat premature for us to identify the source and sink fields with a specific population in V1 because many neural populations in V1 satisfy the basic requirements of having retinotopically organized receptive fields tuned to a specific position and orientation. However, with regard to the completion field, we can be somewhat more definite and identify its locus as the neurons in V2 described by von der Heydt, Peterhans, and Baumgartner (1984). von der Heydt et al. report that the firing rate of these neurons increases when their “receptive fields” are crossed by illusory contours (of specific orientations) induced by pairs of flanking elements. Significantly, these neurons do not respond to these same elements when presented in isolation; they respond only to pairs. It was this discovery that inspired the simple feedforward neural network model proposed by Peterhans, von der Heydt, and Baumgartner (1986). At the implementation level, our model is very similar. However, while Peterhans et al. do not describe the pattern of interconnectivity between the first and second layers of their network in any detail, this pattern is defined
848
Lance R. Williams and David W. Jacobs
Figure 5: (Top left) Source and sink distribution p(x, y, θ; 0) = q(x, y, θ; 0). The p.d.f. consists of four impulses equally spaced around the circumference of a circle. (Top right) The stochastic source field, p0 (u, v, φ), represents the probability that a particle leaving a source will reach (u, v, φ) before it decays. (Bottom left) The stochastic sink field, q0 (u, v, φ), represents the probability that a particle will leave state (u, v, φ) and enter a sink state before it decays. (Bottom right) The stochastic completion field, C(u, v, φ), is the product of source and sink fields.
in our model by the Green’s function, G0 (see Figure 6). Consequently, our neural network computes a stochastic completion field. Although it is possible to solve the stochastic nonlinear differential equation directly and, in doing so, find an analytic equation for the Green’s function (see Thornber & Williams, 1996), we initially solved for G0 by a Monte Carlo method. All of the experimental results in this article are based
Stochastic Completion Fields
849
p( x, y, ; 0)
q( x, y, ; 0)
G’
G’
p’( u, v, )
q’( u, v, )
Source Field
Sink Field
C( u, v, )
Completion Field
Figure 6: The stochastic completion field can be computed by a two-layer feedforward neural network. The source field is computed by the left half of the network and the sink field by the right half. The Green’s function, G0 , defines the network of interconnections between the first and second layers. The third layer computes the product of the source and sink fields and can be tentatively identified with the neurons in V2 described by von der Heydt, Peterhans, and Baumgartner (1984).
on an approximation of G0 computed by direct simulation of the random walk for 1.0 × 106 trials on a 256 × 256 grid with 36 fixed orientations.3 The particle trajectories are modeled as chains of line segments with real valued end points and with interior angles drawn from a continuous gaussian distribution. The probability that a particle beginning at (0, 0, 0) will reach (u, v, φ) before it decays (i.e., G0 (u, v, θ)) is approximated by the fraction of simulated trajectories beginning at (0, 0, 0) that intersect the region (u ± 1.0, v ± 1.0, φ ± π/72). Because the end points of these segments are not confined to the integer lattice, the stochastic completion fields we compute 3
cost.
In a computer vision application, this would be a compile-time cost, not a run-time
850
Lance R. Williams and David W. Jacobs
Figure 7: Example completion fields. (Left) p1 = (−40, 0, 30◦ ) and q1 = (40, 0, −30◦ ). (Right) p2 = (−40, 0, 30◦ ) and q2 = (40, 0, −30◦ ). The left source and sink pair is scaled by 1.0 × 107 and the right by 1.0 × 108 .
are discrete subsamplings of continuous fields, not discrete fields computed by an approximation to the continuous process. 5 Experimental Results As a first demonstration, let us consider two source-sink pairs. The source and sink distributions were represented by arrays of size 128 × 128 × 36 and consisted of a single oriented impulse (a single nonzero value). In the first case, source and sink are positioned on a horizontal axis and are oriented symmetrically about this axis; in the second, they possess the same orientation, so that the curves of least energy joining them will contain an inflection point. The diffusivity, σ 2 , equaled 0.005 and the decay constant, τ , equaled 20. Figure 7 depicts the stochastic completion field for these source and sink pairs as brightness images where brightness encodes the sum over all 36 orientations. Because, the displayed brightnesses for each pair are scaled to take maximum advantage of the limited number of gray levels (i.e., 255), it is not obvious that there is an order of magnitude difference in saliency between the first and second pair. The first source and sink is scaled by 1.0 × 107 , the second by 1.0 × 108 . It is sometimes assumed that illusory contours cannot contain inflection points. However, we do not believe that there is an all-or-nothing rule that prohibits inflection points (Kellman & Shipley, 1991). Rather, we believe that saliency is a continuum, and illusory contours containing inflection points might simply be an order of magnitude weaker on average. In a second demonstration, we consider a source distribution consist-
Stochastic Completion Fields
851
Figure 8: (Left) Ehrenstein figure. (Right) Stochastic completion field for Ehrenstein figure.
ing of four oriented impulses equally spaced around the circumference of a circle (see Figure 5, top left). This distribution is meant to represent an Ehrenstein figure (see Figure 8, left). The four impulses are located at end points of the four line segments comprising the figure and possess orientation normal to the segments. Figure 8 (right) shows the stochastic completion field, where brightness encodes the sum over all orientations. The majority of the brightness is confined to a narrow region surrounding an approximately circular ridge line. This is consistent with the shape most often reported by human observers. Interestingly, some observers report seeing a diamond shape, not a circle. We note that a diamond-shaped completion field would be produced if the diffusivity (σ 2 ) was very large and the particle half-life (τ ) was very small. We have also demonstrated our implementation on several well-known illusory contour figures from the visual psychology literature. Before doing this, we needed to find a principled way of translating an image into a set of sources and sinks for the diffusion process. In general, this requires a sophisticated analysis of the local image brightness structure to identify Ljunctions, T-junctions, Y-junctions, and X-junctions formed by both contrast and outline edges. Classifying and measuring the multiple orientations at so-called key points (Heitger & von der Heydt, 1993) is a difficult research problem in its own right and the subject of much current research (Simoncelli & Farid, 1996; Freeman & Adelson, 1991; Michaelis & Sommer, 1994; Perona, 1992). For the moment, we use a steerable one-sided filter scheme similar to that of Simoncelli and Farid (1996) to identify orientation discontinuities (i.e., corners) formed by contrast edges and to measure the two
852
Lance R. Williams and David W. Jacobs
Figure 9: (Top left) A corner of a “pacman.” (Top right) Sources (thin) and sinks (thick) computed by steerable one-sided filter (multiple orientations are due to anti-aliasing). (Bottom left) Kanizsa triangle. (Bottom right) Stochastic completion field summed over all orientations and superimposed on brightness gradient magnitude image. Both the illusory triangle and the three discs are completed.
orientations with precision sufficient for our purposes. The maximum and minimum of the continuous response of the steerable one-sided filter is first measured to approximately 3 degrees of accuracy. As an anti-aliasing measure, a unit mass is distributed proportionally among the two discrete orientations (10 degrees apart) straddling the nominal maximum and minimum. The maximum is interpreted as a source and the minimum as a sink (see Figure 9, top left and right). Generalizing this scheme to classify and measure the wide range of events corresponding to likely points of boundary
Stochastic Completion Fields
853
occlusion is the only obstacle to demonstrating our work on more realistic images. Figure 9 (bottom left) shows the Kanizsa triangle stimulus. Key points are located at a positive maximum of curvature. Figure 9 (bottom right) shows the stochastic completion field summed over all orientations and superimposed on the brightness gradient magnitude image. Both the illusory triangle and the three discs are completed. In contrast with the results of Heitger and von der Heydt (1993), no nonmaximum suppression needs to be employed. Figure 10 (top) shows the Kanizsa “paisley” stimulus. Key points are located at negative minima of curvature. Figure 10 (bottom) shows the stochastic completion field summed over all orientations and superimposed on the brightness gradient magnitude image. Figure 11 (top) shows a complex illusory contour figure designed by Kanizsa (1979). Key points are located at positive maxima of curvature. It is useful to compare the stochastic completion field computed for this display to the middle panel of Figure 1. Because our diffusion process has no knowledge of the topology of surfaces, the stochastic completion field, shown in Figure 11 (bottom left), contains potential completions that are not perceived by human subjects. Potential completions required to complete the four rectangles are among the most salient, however. Figure 11 (bottom right) shows the logarithm of the stochastic completion field summed over all orientations. In the logarithm image, many additional completions of significantly lower average likelihood become visible. Included among these are those required to complete the four black discs and eight black squares perceived by human subjects.
6 Conclusion It is widely acknowledged that perceptual organization (segmentation, grouping) is among the most difficult problems facing researchers in computer vision today. Current structure from motion algorithms produces estimates of depth only for isolated points, yet true surface representations are required to compute a stable grasp or to plan an unobstructed path through free space. Current object recognition systems do not work on large model bases, nor do they function robustly in the presence of occlusion. Our work advances the state of the art in perceptual organization by providing a plausible model of illusory contour formation based on a diffusion process. We have shown how curves of least energy interpolating a set of edge fragments can be computed using biologically plausible algorithms and representations. Significantly, this is accomplished without using numerical relaxation or other explicit minimization but instead relies on the fact that the probability that a particle following a random walk will pass through a particular position and orientation on a path joining two image
854
Lance R. Williams and David W. Jacobs
Figure 10: (Top) Kanizsa’s “paisleys.” (Bottom) Stochastic completion field integrated over all orientations and superimposed on brightness gradient magnitude image. It is interesting to note that the average likelihoods of the shorter completions (perceived modally) are several orders of magnitude greater than the average likelihoods of the longer completions (perceived amodally).
Stochastic Completion Fields
855
Figure 11: (Top) A complex figure (Kanizsa, 1979). (Bottom) Stochastic completion field integrated over all orientations and superimposed on brightness gradient magnitude image. Logarithm of stochastic completion field integrated over all orientations.
measurements can be computed directly as the product of two vector-field convolutions. The most related work is by Heitger and von der Heydt (1993), Guy and Medioni (1993) and Shashua and Ullman (1988). Although we owe an intellectual debt to these three articles, none proceeds cleanly from a definition of what they want to compute (a function) to a method of computing it (an algorithm). Our main contribution is to show how outwardly similiar algorithms and representations follow immediately from a clearly stated assumption: that the prior probability distribution of boundary completion shapes can be modeled by a random walk.
856
Lance R. Williams and David W. Jacobs
Appendix In this appendix, we demonstrate the relationship between the diffusion process we have described and so-called curves of least energy. To accomplish this, we employ a discrete-time approximation to the continuous equations of motion. For the discrete-time random walk, we show that the most likely path between a source and a sink is the curve of least energy connecting them. This tells us that, in the limit, as the time-step size becomes small, the most likely path between a source and a sink converges to the continuous curve of least energy. From this, we conclude that our model of diffusion concisely encodes the prior assumption that the most likely shape of an occluded section of an object’s boundary is the curve of least energy. Suppose that a particle begins its random walk at a source p and follows a trajectory 0, which consists of n unit length steps, along with n changes in angle denoted by κ1 , . . . , κn , ending at sink q. That is, its trajectory is an n-sided polygonal arc, comprising unit length segments and with exterior angles denoted by κi . From the definition of the random walk, we know that the density function on the set of paths that such a particle may take is given by g(0pq ) = R 0q
n k2 Y 1 1 1 − i e− τ √ e 2σ 2 , f (0p )d0p i=1 σ 2π
R
where 0q f (0p )d0p indicates integration of the density function on the set of paths beginning in source p (i.e., f (0p )) over all paths ending in sink q (see Ash, 1972, for a discussion of conditional probabilities on R continuous functions). Taking the logarithm of both sides (and because 0q f (0p )d0p is constant for given p and q), µ ¶ X n √ 1 ki2 log(g(0pq )) + C = n − − log(σ 2π ) − . τ 2σ 2 i=1
(A.1)
We now relate this expression to the energy of a continuous curve. This energy is defined as Z
Z α
0
κ(t)2 dt + β
0
dt,
where κ(t) is the curvature at 0(t) and α and β are constants that weight the cost for the length relative to the cost for the total squared curvature. Although the energy of any curve with a curvature discontinuity is infinite, it is natural to consider an analogous quantity for polygonal arcsP based on number of segments (n); and sum of the squared exterior angles ( ni=1 ki2 ). This is precisely the negative of the right side of equation (A.1), with α = 1/2σ 2 and
Stochastic Completion Fields
857
√ β = 1/τ +log(σ 2π). This shows that the log likelihood that a random walk will follow any given polygonal path is linearly related to the energy of that path for a natural discrete formulation of energy. Therefore, the maximum likelihood polygonal path between two points is the discrete curve of least energy. Because the discrete-time random walk converges to a continuoustime random walk as the time-step size, velocity, and diffusivity approach zero, it follows that the continuous equations of motion concisely encode our prior assumption concerning the probability distribution of completion shape. Acknowledgments We gratefully acknowledge many helpful conversations with Karvel Thornber. Thanks also to Ingemar Cox and Bill Bialek. References Alter, T., & Basri, R. (1996). Extracting salient contours from images: An analysis of the saliency network, Proc. of IEEE Conf. on Computer Vision and Pattern Recognition, San Francisco, pp. 13–20. Ash, R. (1972). Real analysis and probability. San Diego: Academic Press. Blasdel, G., & Obermeyer, K. (1994). Putative strategies of scene segmentation in monkey visual cortex. Neural Networks, 7(6–7), 865–881. Carman, G. J. (1990). Mappings of the cerebral cortex. Unpublished doctoral dissertation, California Institute of Technology, Pasadena, CA. Carman, G. J., & Welch, L. (1992). Three-dimensional illusory contours and surfaces. Nature, 360, 585–587. Carman, G. J., & Welch, L. (1993). Position, orientation and shape of real and illusory three-dimensional surfaces. Investigative Opthalmology and Visual Science (ARVO), 34(4), 1131. Cox, I. J., Rehg, J. M., & Hingorani, S. (1993). A Bayesian multiple hypothesis approach to edge grouping and contour segmentation. International Journal of Computer Vision, 11, 5–24. David, C., & Zucker, S. W. (1990). Potentials, valleys and dynamic global coverings. Inernational Journal of Computer Vision, 5(3), 219–238. Freeman, W. T., & Adelson, E. H. (1991). The design and use of steerable filters for image analysis, enhancement and wavelet representation. IEEE Pattern Analysis and Machine Intelligence, 13, 891–906. Grossberg, S., & Mingolla, E. (1985). Neural dynamics of form perception: Boundary completion, and illusory figures, and neon color spreading. Psychological Review, 92, 173–211. Grossberg, S., & Olson, S. J. (1994). Rules for the cortical map of ocular dominance and orientation columns. Neural Networks, 7(6–7), 883–894. Guy, G., & Medioni, G. (1993). Inferring global perceptual contours from local features. Inernational Journal of Computer Vision, 20, 113–133.
858
Lance R. Williams and David W. Jacobs
Heitger, R., & von der Heydt, R. (1993). A computational model of neural contour processing, figure-ground and illusory contours. Proc. of 4th Intl. Conf. on Computer Vision, Berlin, Germany. Horn, B. K. P. (1981). The curve of least energy (MIT AI Lab Memo No. 612). Cambridge, MA: MIT. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160, 106–154. Huffman, D. A. (1971). Impossible objects as nonsense sentences. Machine Intelligence, 6, 295–323. Jacobs, D. (1989). Grouping for recognition (MIT AI Lab Memo No. 1177). Cambridge, MA: MIT. Kanizsa, G. (1979). Organization in vision. New York: Praeger. Kellman, P. J., & Shipley, T. F. (1991). A theory of visual interpolation in object perception. Cognitive Psychology, 23, 141–221. Michaelis, M., & Sommer, G. (1994). Junction classification by multiple orientation detection. Proc. of European Conf. on Computer Vision, pp. 101–108. Mumford, D. (1994). Elastica and computer vision. Algebraic Geometry and Its Applications. Chandrajit Bajaj (Ed.). New York: Springer-Verlag. Parent, P., & Zucker, S. W. (1989). Trace inference, curvature consistency and curve detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11, pp. 823–889. Perona, P. (1992). Steerable, scalable kernels for edge detection and junction analysis. Proc. of European Conf. on Computer Vision, pp. 3–18. Peterhans, E., von der Heydt, R., & Baumgartner, G. (1986). Neuronal responses to illusory contour stimuli reveal stages of visual cortical processing. In J. Pettigrew, K. Sanderson, & W. Levick, (Eds.), Visual Neuroscience. Cambridge: Cambridge University Press. Shashua, A., & Ullman, S. (1988). Structural saliency: The detection of globally salient structures using a locally connected network. 2nd Intl. Conf. on Computer Vision, Clearwater, FL, pp. 321–327. Simoncelli, E. P., & Farid, H. (1996). Steerable wedge filters for local orientation analysis. IEEE Transactions on Image Processing, 5(9), 1377–1382. Thornber, K. K., & Williams, L. R. (1996). Analytic solution of stochastic completion fields. Biological Cybernetics, 75, 141–151. Ullman, S. (1976). Filling in the gaps: The shape of subjective contours and a model for their generation. Biological Cybernetics, 21, 1–6. von der Heydt, R., Peterhans, E., & Baumgartner, G. (1984). Illusory contours and cortical neuron responses. Science, 224, 1260–1262. Williams, L. R., & Hanson, A. R. (1996). Perceptual completion of occluded surfaces. Computer Vision and Image Understanding, 64(1), 1–20.
Received May 5, 1995; accepted July 12, 1996.
Communicated by Steven Zucker
Local Parallel Computation of Stochastic Completion Fields Lance R. Williams David W. Jacobs NEC Research Institute, Princeton, NJ 08540, USA
We describe a local parallel method for computing the stochastic completion field introduced in the previous article (Williams and Jacobs, 1997). The stochastic completion field represents the likelihood that a completion joining two contour fragments passes through any given position and orientation in the image plane. It is based on the assumption that the prior probability distribution of completion shape can be modeled as a random walk in a lattice of discrete positions and orientations. The local parallel method can be interpreted as a stable finite difference scheme for solving the underlying Fokker-Planck equation identified by Mumford (1994). The resulting algorithm is significantly faster than the previously employed method, which relied on convolution with large-kernel filters computed by Monte Carlo simulation. The complexity of the new method is O(n3 m), while that of the previous algorithm was O(n4 m2 ) (for an n × n image with m discrete orientations). Perhaps most significant, the use of a local method allows us to model the probability distribution of completion shape using stochastic processes that are neither homogeneous nor isotropic. For example, it is possible to modulate particle decay rate by a directional function of local image brightnesses (i.e., anisotropic decay). The effect is that illusory contours can be made to respect the local image brightness structure. Finally, we note that the new method is more plausible as a neural model since (1) unlike the previous method, it can be computed in a sparse, locally connected network, and (2) the network dynamics are consistent with psychophysical measurements of the time course of illusory contour formation. 1 Introduction In recent years, several different researchers (Grossberg & Mingolla, 1985; Guy & Medioni, 1996; Heitger & von der Heydt, 1993; Williams & Jacobs, 1997) have formulated computational models of perceptual completion and illusory contours based on large-kernel convolution.1 Although the function computed in each case is different, each method requires computing 1 See “Stochastic Completion Fields” elsewhere in this journal for a detailed literature review.
Neural Computation 9, 859–881 (1997)
c 1997 Massachusetts Institute of Technology °
860
Lance R. Williams and David W. Jacobs
one or more convolutions with filters represented by kernels of size equal to the largest gaps to be bridged (usually taken to be the size of the image). From the standpoint of computer vision, it should be clear that computing large numbers of convolutions with kernels of size equal to the image size is prohibitively expensive. These methods also have deficiencies as models of human visual processing, since implementation in vivo would require pairs of neurons separated by large amounts of visual arc to be densely interconnected. Although this possibility cannot be ruled out,2 a model based on local connections clearly makes weaker demands on neuroanatomy (see Figure 1). Data from human psychophysics are also inconsistent with methods requiring large-kernel convolution. First, experiments by Rock and Anson (1979) (and by Ramachandran, Ruskin, Cobb, Rogers-Ramachandran, & Tyler, 1994) demonstrate (see Figure 2) that illusory contours can be suppressed by the presence of texture or other figural elements along the path the completion would follow if these elements were absent. That is, the shape, salience, and sharpness of the completion are not solely a function of characteristics of the pair of elements it bridges but are also a function of the intervening pattern of image brightnesses. Second, recent studies of the time course of illusory contour formation (Rubin, Shapley, & Nakayama, 1995) show that the time required for an illusory contour to form is at least partly a function of the size of the gap to be bridged.3 Again, this is consistent with a local-iterative process since a global process would require an amount of time independent of the size of the gap to be bridged. In the previous article (Williams and Jacobs, 1997), we presented a theory of the shape, salience, and sharpness of illusory contours and other perceptual completions. This theory was based on the assumption that the visual system computes the probability that an object boundary (possibly occluded) exists at each position and orientation in the visual field. This representation was termed a stochastic completion field. The inputs to this computation are measurements derived from the image about locations where completions might begin and end and a model for the probability distribution of completion shapes. Like Mumford (1994), we proposed that the prior probability distribution of completion shapes can be modeled by
2
Absence of evidence is not evidence of absence. This result is due to Rubin, Shapley, and Nakayama (1995), who used a forced-choice discrimination task together with masking stimulus to estimate the time required for illusory contour formation. They first showed that the time required for an illusory contour to form increases in direct proportion to distance when the “pacmen” in a Kanizsa display are “pulled apart.” Somewhat surprisingly, they then showed that the time required for an illusory contour to form remains approximately constant when the entire figure is uniformly scaled. This somewhat paradoxical result is consistent with a stochastic completion field “pyramid” computed by repeated small-kernel convolution. 3
Computation of Stochastic Completion Fields
(a)
p( x, y, ; 0)
861
q( x, y, ; 0)
G’
G’
p’( u, v, )
q’( u, v, )
Source Field
Sink Field
C( u, v, )
Completion Field
(b)
p( x, y, ; 0)
q( x, y, ; 0)
p( u, v, ; t)
q( u, v, ; t)
p’( u, v, )
q’( u, v, )
Source Field
Sink Field
C( u, v, )
Completion Field
Figure 1: Two possible neural networks for computing stochastic completion fields. (a) Convolution with large-kernel filters requires that neurons separated by large amounts of visual arc be densely interconnected. (b) Repeated smallkernel convolution makes weaker demands on neuroanatomy since it can be implemented in a network with only local connections.
862
(a)
Lance R. Williams and David W. Jacobs
(b)
Figure 2: Rock and Anson (1979) show that illusory contours in the Kanizsa triangle can be suppressed by the presence of a background texture. (a) Kanizsa triangle. (b) Kanizsa triangle with background texture.
a particle undergoing a stochastic motion (i.e., a directional random walk).4 Unlike the familiar two-dimensional isotropic random walk, where a particle’s state is simply its position in the plane, the particles of the directional random walk possess both position and orientation. The particle moves in the plane with constant speed in the direction specified by its orientation. ˙ is a normally distributed random variable with Change in orientation (θ) zero mean and variance equal to σ 2 so that the particle’s orientation (θ ) is a Brownian motion. The result is that particles tend to travel in straight lines but deviate from straightness by an amount that depends on σ 2 . This reflects a prior expectation of smoothness. In addition, a constant fraction of 1 particles decay per unit time (1 − e− τ , where τ is the half-life of the particle), and this reflects a prior expectation of shortness. If there exist two subsets of measurements, P and Q (representing the beginning and ending points of a set of boundary fragments), then the stochastic completion field, C(u, v, φ), is defined as the probability that a particle, initially at (xp , yp , θp ), for some p ∈ P, will, in the course of a di-
4
The important thing is not the particles themselves but the distribution of occluded shapes that the device of stochastic particle motion is intended to characterize. A particle follows a trajectory that represents one of the many possible shapes that an occluded object’s boundary might have in the region where it is not directly visible. The probability that a particle follows a path of a particular shape is presumed to equal the frequency with which that shape occurs in the world.
Computation of Stochastic Completion Fields
863
rectional random walk, pass through (u, v, φ), on its way to (xq , yq , θq ), for some q ∈ Q. This definition emphasizes the input and output of the computation that the visual system is performing (the function). It describes what information from the image and generic knowledge of the world should be combined to estimate the shape of occluded object boundaries. Using Marr’s terminology (Marr, 1982), this is a computational theory—level model. We feel that our work demonstrates the value of Marr’s approach for understanding human vision. This approach distinguishes between theories at three levels: (1) computational theories, (2) algorithm- and representation-level theories, and (3) implementation-level theories. Identical functions might be computable by different algorithms, and identical algorithms can be implemented (in the brain) in different ways. Many previous theories of figural completion have been specified only at the level of algorithm and representation. These models use an abstraction of biological neural networks as a kind of programming language. Unfortunately, it is difficult to understand exactly what problem these programs solve and what assumptions they make. Consequently the activity of units in these models and the computations they perform are often impossible to interpret. In contrast, our computational theory has allowed us to formulate algorithms based on operations that are meaningful for distributions of contours.5 In this article, we show how the stochastic completion field can be computed efficiently in a local parallel network. The network computation is based on a finite difference scheme for solving the Fokker-Planck equation identified by Mumford (1994). This algorithm avoids computationally prohibitive large-kernel convolutions and is more consistent with known psychophysics and neuroanatomy. 2 Review of Previous Method Because the random walk is defined by a Markov process, C(u, v, φ) is proportional to the product of the probability that a particle beginning in a source will reach (u, v, φ) before it decays (the source field) and the probability that a particle beginning at (u, v, φ) will reach a sink before it decays (the sink field). Furthermore, because the Markov process is translation and rotation invariant, the probability that a particle will reach (u, v, φ) can be computed by “convolving” the source distribution, p(x, y, θ ; 0), with the Green’s function representing the probability that a particle at (0, 0, 0) at time zero will reach (u0 , v0 , φ 0 ), p0 (u, v, φ) =
Z
Z
∞
dx
−∞
Z
∞
dy
−∞
π
dθ G0 (u0 , v0 , φ 0 ) p(x, y, θ ; 0),
−π
5 It has also made obvious the relationship between the output of our algorithms and curves of least energy (see Mumford, 1994; “Stochastic Completion Fields,” elsewhere in this journal). This relationship is totally opaque in previous models.
864
Lance R. Williams and David W. Jacobs
where (u0 , v0 , φ 0 ) is (u, v, φ) rotated by −θ about (x, y), so that u0 = (u − x) cos θ + (v − y) sin θ, v0 = −(u − x) sin θ + (v − y) cos θ and φ 0 = φ − θ. The Green’s function, G0 , was originally computed by a Monte Carlo method that involved simulating the random walks of 1.0 × 106 particles on a 256 × 256 grid with 36 fixed orientations. The probability that a particle beginning at (0, 0, 0) reaches (x, y, θ ) before it decays was approximated by the fraction of simulated trajectories beginning at (0, 0, 0) that intersect the region (x ± 1.0, y ± 1.0, θ ± π/72). The sink field, q0 (u, v, φ), which represents the probability that a particle leaving (u, v, φ) will reach a sink state before it decays, can be computed in a similar fashion, q0 (u, v, φ) =
Z
Z
∞
dx
−∞
Z
∞
dy
−∞
π
dθ G0 (x0 , y0 , θ 0 ) q(x, y, θ ; 0),
−π
where (x0 , y0 , θ 0 ) is (x, y, θ ) rotated by −φ about (u, v), so that x0 = (x − u) cos φ + (y − v) sin φ, y0 = −(x − u) sin φ + (y − v) cos φ, and θ 0 = θ − φ. Finally, the stochastic completion field, C(u, v, φ), which represents the relative likelihood that a particle leaving a source state will pass through (u, v, φ) and enter a sink state before it decays, equals the product of the source and sink fields: C(u, v, φ) = p0 (u, v, φ) · q0 (u, v, φ). Clearly, the run-time complexity (neglecting the time required to compute G) of the above process is dominated by the convolution with the Green’s function G0 . The overall run-time compexity is therefore O(n4 m2 ) for an n×n image with m discrete orientations. 3 Local Parallel Computation In this section we show how to compute the completion field in a local parallel network. Fundamentally, the computation is the same as before: 1. Compute the source field. 2. Compute the sink field. 3. Compute their product. The difference is in the way the source and sink fields are computed. In the new algorithm, the source field is computed by integrating the FokkerPlanck equation for the stochastic process identified by Mumford (1994). The probability density function representing a particle’s position at time t0 is given by p(x, y, θ ; t0 ) = p(x, y, θ ; 0) +
Z 0
t0
∂p(x, y, θ ; t) ∂t
Computation of Stochastic Completion Fields
865
and ∂P ∂P ∂ 2P ∂P = − cos θ − sin θ + σ 2 /2 − 1/τ P, ∂t ∂x ∂y ∂θ 2 where P is p(x, y, θ ; t). The Fokker-Planck equation for the stochastic process is best understood by considering each of the four terms separately. The first two terms taken together are the so-called advection equation: ∂P ∂P ∂P = − cos θ − sin θ . ∂t ∂x ∂y Solutions of this equation have the form p(x, y, θ ; t) = p(x + cos θ 1t, y + sin θ 1t, θ ; t + 1t) for small 1t. That is, the probability density function for constant θ at times t and t + 1t is related through a translation by the vector [cos θ 1t, sin θ 1t]. Intuitively, for constant θ, probability density is being transported in the θ direction with unit speed. The third term is the classical diffusion equation: ∂ 2P ∂P = σ 2 /2 . ∂t ∂θ 2 Solutions of the diffusion equation are given by gaussian convolutions. Although, strictly speaking, the gaussian is not defined for angular quantities, for small 1t and variance σ 2 the following will approximately hold: p(x, y, θ ; t + 1t) =
Z π 1 dφ p(x, y, θ − φ ; t) √ σ 2π 1t −π · exp(−φ 2 /2σ 2 1t).
The last term implements the exponential decay: ∂P = −1/τ P. ∂t It is possible to interpret the Fokker-Planck equation in this case as describing a set of independent advection equations (one for each possible value of θ) coupled in the θ dimension by the diffusion equation. Taken together, the terms comprising the Fokker-Planck equation faithfully model the evolution in time of the probability density function representing a particle’s position and orientation. The value of the source field will also depend on the initial conditions (the probability density function [p.d.f.] describing the particle’s position at time
866
Lance R. Williams and David W. Jacobs
zero). In our implementation, the p.d.f. defining the initial conditions consist of oriented impulses located at corners detected in the intensity image. These impulses represent the starting (or ending) points of possible completions. Strictly speaking, the Fokker-Planck equation can model our diffusion process only if we assume smooth initial conditions. Unfortunately, because the p.d.f.s defining the source and sink distributions are discontinuous, their partial derivatives are not defined. Moreover, the discrete method that we describe for solving the Fokker-Planck equation will be accurate only when a first-order approximation holds well over the length of scales that we discretely represent. For now, we will assume that the initial conditions are smooth and are well approximated by first-order expressions. We then show how to solve this partial differential equation iteratively. In the next section, we consider the accuracy of this method and describe a stochastic interpretation of the local method that does not require continuous initial conditions. A standard technique for solving partial differential equations on a grid is to use the Taylor series expansion to write the partial derivatives at grid points as functions of the values of nearby grid points and of higher-order terms. Then by neglecting higher-order terms, we obtain an equation that expresses the value of a grid point at time t + 1 as a function of nearby grid points at time t, and which is accurate to first order (Ames, 1992; Ghez, 1988). This technique leads to the following iterative method for solving the Fokker-Planck equation: ( Step 1:
t+1/4 px,y,θ
− cos θ ·
ptx,y,θ − ptx−1,y,θ
if cos θ > 0
ptx+1,y,θ − ptx,y,θ
if cos θ < 0
=
ptx,y,θ
Step 2:
t+1/2 px,y,θ
=
t+1/4 px,y,θ
Step 3:
px,y,θ
t+3/4
=
λ px,y,θ −1θ + (1 − 2λ) px,y,θ + λ px,y,θ+1θ
Step 4:
pt+1 x,y,θ
=
e− τ · px,y,θ
− sin θ ·
t+1/2
1
t+1/4 pt+1/4 x,y,θ − px,y−1,θ
if sin θ > 0
pt+1/4 − pt+1/4 x,y,θ x,y+1,θ
if sin θ < 0
t+1/2
t+1/2
t+3/4
where λ = σ 2 /2(1θ)2 . These equations require a few comments. First we have split the computation into four separate steps. This standard fractional method allows us to compute p(x, y, θ ; t + 1) with a series of separate convolutions in each of the three spatial directions. Second, notice that the first two steps of the computation depend on the sign of cos θ and sin θ. To understand why, consider a particle traveling in the positive x direction. In this case, we wish to ensure that our approximation of ∂p(x, y, θ ; t)/∂x depends on p(x − 1, y, θ ; t) but not on p(x + 1, y, θ ; t) because p(x, y, θ ; t + 1) depends on the first value but not the second. This method of ensuring that the
Computation of Stochastic Completion Fields
867
local difference used to approximate derivatives depends on the direction relevant to the underlying physical process is called upwind differencing and is necessary for stability. Finally, we can see that the recurrence relationship given by these equations can be computed through small-kernel convolution. For example, we can compute the first recurrence using a 1 × 3 kernel centered on the pixel being convolved. This kernel is [cos θ, 1 − cos θ, 0] for cos θ > 0 and [0, 1 + cos θ, − cos θ ] for cos θ < 0. The source field, p0 (x, y, θ ), which represents the probability that a particle beginning at a source will reach (x, y, θ) before it decays, is computed by integrating the probability density function representing the particle’s position over time: Z ∞ dt p(x, y, θ ; t). p0 (x, y, θ ) = 0
We can approximate the integral of p(x, y, θ ; t) over all times less than some fixed time t0 as follows, 0
0
p (x, y, θ ; t ) ≈
t0 X
p(x, y, θ ; t),
t=0
which can be computed by the following recurrence: p0 (x, y, θ ; t + 1) = p0 (x, y, θ ; t) + p(x, y, θ ; t + 1). Since O(n) iterations are required to bridge the largest gaps that might be found in an image of size n × n and each iteration requires O(n2 m) time, the run-time complexity of the small-kernel method is O(n3 m). In our experiments, the image size is typically 256 × 256 with 36 discrete orientations. This represents an effective speedup on the order of one thousand times when compared against the previous algorithm. We note that since the new algorithm is local and parallel, the communication overhead on a SIMD computer would be negligible, so that the parallel time complexity would be O(n) on a machine with O(n2 m) processors. This is consistent with the estimated time required for illusory contours to form in human vision (Rubin et al., 1995). 4 Stability and Accuracy Considerations We have presented a method of computing the stochastic completion field using repeated small-kernel convolutions. We now consider how faithfully this new method models the underlying stochastic process. First, we identify the conditions under which our finite-difference scheme converges to the underlying partial differential equation that it models. We then present a stochastic interpretation of the small-kernel method. This interpretation embodies our original (i.e., physical) intuition and applies even in the case
868
Lance R. Williams and David W. Jacobs
of discontinuous initial conditions. Finally, we compare the accuracy of our new method to our previous approach, which used a single large convolution mask. We can use standard techniques to determine that our method will be stable, provided that the following three conditions are met: 1. λ = (σ 2 1t)/2(1θ)2 ≤ 12 . 1t ≤ 1. 2. cos θ 1x 1t ≤ 1. 3. sin θ 1y
Informally, stability is necessary to ensure that the solution of the partial differential equation at each time step incorporates all the possibly relevant information from the initial conditions and to ensure convergence to the correct solution as space and time are discretized more finely. In the experiments that we report in this article, we assume that 1x = 1y = 1t = 1. Therefore, the last two conditions will always be met. However, should we choose to discretize space more finely (to achieve greater accuracy), then we must also discretize time proportionately. Regarding the first condition, in all of our experiments we set 1θ = π/36. With this level of discretization, our method will be stable only when σ ≤ π/36. All of our experiments use values of σ much less than this limit. We now present a stochastic interpretation of repeated small-kernel convolution. This discussion serves two purposes. First, the stochastic interpretation more directly embodies our original (physical) intuition that the prior probability distribution of completion shape can be modeled as a random walk in a lattice of discrete positions and orientations. Second, our stochastic interpretation applies even in the case of discontinuous initial conditions, where interpreting the small-kernel method as a first-order approximation to a partial differential equation breaks down. The small-kernel method can be viewed as computing the probability distribution of the position of a particle undergoing a random walk in a lattice in R2 × S1 . Suppose there is a particle in this lattice, with position (x, y, θ ), which at the first time step: • Moves to the neighboring site in the x direction with probability max(0, cos θ). • Moves to the neighboring site in the −x direction with probability max(0, − cos θ). • Remains at the current site with probability (1 − | cos θ |). See Figure 3a. Clearly the convolution defined by the Step 1 equation updates the probability distribution on the lattice consistent with this motion. In a similar fashion, the change in the probability distribution of the particle’s y-position is updated in the second time step. This update is described
Computation of Stochastic Completion Fields
S T E P
cos
1−cos
1
S T E P
869
sin 1−sin
2
Y
(a)
X
S T E P
(b)
S T E P
1−2
3
e
1
4
(c)
(d)
Figure 3: The small-kernel method can be interpreted as a random walk in a lattice of discrete positions and orientations. (a) Step 1: Advection in x direction (for cos θ ≥ 0). (b) Step 2: Advection in y direction (for sin θ ≥ 0). (c) Step 3: Diffusion in θ . (d) Step 4: Exponential decay.
by the Step 2 equation (see fig. 3b). In the third time step, the particle: • Moves to the neighboring site in the θ direction with probability λ. • Moves to the neighboring site in the −θ direction with probability λ. • Remains at the current site with probability (1 − 2λ). See Figure 3c. The Step 3 equation updates the probability distribution on the lattice consistent with this motion. Finally, the Step 4 equation implements the exponential decay of the particle (see Figure 3d). After t iterations of these four steps, we have computed the distribution of the position of a set of particles at time t, given initial conditions that specify a distribution of their positions at time zero. This stochastic process is an exact model of
870
Lance R. Williams and David W. Jacobs
what our small-kernel method computes and is valid even when the initial conditions are discontinuous. We now estimate the difference in magnitude between the outputs of the small- and large-kernel methods. Previously we constructed a large-kernel filter by simulating the paths of a large number of particles. Each particle starts its stochastic motion at the origin, with a direction of θ = 0. Time was divided into discrete units, and at each time step, 1θ was drawn from a continuous, gaussian distribution. The x and y positions of the particle at each time step were represented as real numbers. The magnitude of the large-kernel filter at any given point equaled the fraction of trajectories intersecting a small volume of x − y − θ space. It is straightforward to show P that the distribution of a particle’s x position after t time steps is ti=1 ci (where ci = cos θi and si = sin θi ). We now contrast this with the small-kernel method. If we let xi be a random variable that is 1 with probability ci and 0 with probability 1 − ci , then repeated small-kernel convolution produces values in the x direction P that are a probability distribution for the random variable ti=1 xi . Since the Pt expected value of xi is ci , it follows that the distribution of i=1 xi has mean Pt Pt Pt 2 2 i=1 ci and variance i=1 (xi − ci ) = i=1 (ci − ci ). This variance can range from 0 (when all ci = 0 or all ci = 1) to as much as t/4 when all ci equal 1/2 (when θ = π/3). Moreover, it follows from Lindeburg’s theorem that if this distribution is normalized to be of mean 0 and unit variance, it will converge to a normal distribution as t → ∞. This depends on the variables satisfying the Lindeburg condition, which states roughly that each term in the sum becomes, in the limit, a vanishingly small component of the overall sum (see Feller, 1971). Similar reasoning tells us that the distribution of a particle’s position P P in the y direction will have mean ti=1 si and variance ti=1 (yi − si )2 = Pt 2 i=1 (si − si ), which can also range from 0 to t/4. So, for example, with t = 100 the standard deviation of this “spread” in the position of a particle is bounded by five pixels. This effect is not isotropic. The above expressions show that particles traveling in directions parallel to the x and y axis will have zero variance, while those traveling in the π/4 direction will have maximum variance.6 Also, the source field can be computed with greater accuracy (if needed) at the cost of additional computation. Suppose that we divide each unit in space, angle, and time into d subunits and perform the same computations on these subunits. Similar calculations show that the variance in each direction will be bounded by t/4d pixels (i.e., the standard deviation drops with 6 The lack of any detectable global phase coherence in the orientation preference structure of visual cortex (Niebur & Worg ¨ otter, ¨ 1994) suggests that (unlike the obvious computer implementation on a rectangular grid) systematic long-range errors due to local anisotropy will be minimal in vivo.
Computation of Stochastic Completion Fields
871
the square root of the number of discrete units used, as might be expected). We must, however, perform td convolutions on a field of size nd × nd × m, rather than t convolutions on a field of size n × n × m, increasing the run time by a factor of d3 . 5 Anisotropic Decay The fact that previous computation took the form of a convolution with a large kernel was a simple consequence of the translational and rotational invariance of the random process. That is, the likelihood of a particle’s taking a random walk of a given shape is assumed to be independent of the random walk’s starting point and orientation. Because the new method is based on repeated small-kernel convolution, we are free to consider stochastic processes that are not the same everywhere, that is, processes that are neither homogeneous nor isotropic. There is some precedent for this in computer vision, where different methods lumped together under the term “anisotropic diffusion” are becoming increasingly popular for image enhancement (Perona & Malik, 1990; Nitzberg & Shiota, 1992; Weickert, 1995). The analogy breaks down in the sense that the quantity being “diffused” in the one case is image brightness, and in our case, the probability density of a particle’s position and orientation. Conceivably each of the two parameters defining the stochastic process (σ 2 and τ ) could be modulated by local directional functions of the image brightness. Interpreted within a Bayesian framework, this could be viewed as augmenting the prior model with the additional information provided by the local image brightnesses to estimate better the parameters of the stochastic process modeling completion shape within the local neighborhood. In this article, we take only a first step in this direction by modulating the half-life of the particle by a directional function of the local image brightness. Using this anisotropic decay mechanism, we can ensure that illusory contours do not cross large brightness discontinuities. The first requirement is that in regions where brightness is changing slowly, a particle’s half-life should remain constant. This will allow illusory contours to form in these regions. The second requirement is that along edges, the half-life should be very short in directions leading away from the edge but very long in directions leading toward the edge. This will have the effect of channeling particles toward the edges. Finally, we require that the half-life should be very long in the direction tangent to the edge. This will encourage particles to continue traveling along the edge. All of these requirements can be satisfied by a function based on the directional second derivative, D2 I if D2θ +π/2 I 6= 0 and D1θ I > 0 τ ·e θ 2 −D I τ (θ) = τ ·e θ if D2θ +π/2 I 6= 0 and D1θ I < 0 τ + |∇I| otherwise,
872
Lance R. Williams and David W. Jacobs
where D1θ I and D2θ I are the first and second directional derivatives of image brightness in the direction of the particle’s motion and D2θ +π/2 I is the second directional derivative in the direction normal to the particle’s motion. The condition D2θ +π/2 I 6= 0 is simply used to decide whether the particle is traveling along an edge (a zero crossing of the second directional derivative in the direction normal to the particle’s motion). If so, the half-life is taken to be τ + |∇I|, which becomes τ in homogeneous areas as required, irrespective of whether there are zero crossings. If the particle is not traveling in the direction of an edge, then the half-life is τ · exp(D2θ I) or τ · exp(−D2θ I), depending on the sign of D1θ I. This expression is large or small depending on whether the particle is moving toward or away from an edge, and it evaluates to τ in homogenous areas. 6 Experimental Results Through a sequence of pictures shown in Figure 4, we first demonstrate the propagation of two probability density “waves” emitted from a pair of sources located on a line parallel to the x-axis and with orientations equal to π/6 and 5π/6. The pictures show the probability density (p(x, y, θ ; t)) summed over all orientations (the marginal distribution in θ ). The time interval between successive pictures in the sequence, 1t, equals 5, the variance, σ 2 , equals 0.005, and the half-life, τ , equals 100. Error due to the inherent anisotropy of the finite-difference scheme is especially pronounced in the first five frames (frames 1–5) where it manifests itself as a flattening of the face of the wave along the vertical direction. This effect diminishes as the size of the wave increases, the result of greater effective resolution. In frames 11–20 the waves from the different sources can be seen to pass through each other. Because they are traveling in different “orientation planes,” there is no “collision.” Finally, we observe that there is a visible decrease in brightness between the first and last frame of the sequence. Although some of this decrease in brightness can be attributed to the diffusion in space, the larger part can be attributed to the exponential decay of the particles. A second sequence of images, shown in Figure 5, depicts the formation of the completion field as a function of time. To simplify this demonstration, p.d.f.s representing the source and sink distributions at time zero were assumed to be equal. Frames 1–10 are totally dark, since insufficient time has elapsed for particles traveling from the two sources to reach each other. The completion field first appears in frame 11, at the point midway between both sources. In frames 12–20 the completion field rapidly spreads outward in both directions away from the midpoint and back toward the two sources.7 7 Assuming that the neural locus of the stochastic completion field is the population of neurons in V2 identified by von der Heydt, Peterhans, and Baumgartner (1984) (see “Stochastic Completion Fields” elsewhere in this journal), then this prediction could be verified by measuring the latency between stimulus presentation and the onset of the
Computation of Stochastic Completion Fields
873
Figure 4: Sequence of images depicting probability density “waves” emitted by a pair of sources located on a line parallel to the x-axis and with orientations equal to π/6 and 5π/6. The pictures show the probability density (p(x, y, θ ; t)) summed over all orientations (the marginal distribution in θ). The time interval between successive pictures in the sequence, 1t, equals 5; the variance, σ 2 , equals 0.005; and the half-life, τ , equals 100.
874
Lance R. Williams and David W. Jacobs
Figure 5: Sequence of images depicting the formation of the completion field (C(u, v, φ ; t)) as a function of time. Frames 1–10 are totally dark, since insufficient time has elapsed for particles traveling from the two sources to reach each other. The completion field first appears in frame 11, at the point midway between both sources. In frames 12–20 the completion field rapidly spreads outward in both directions away from the midpoint and back toward the two sources.
Computation of Stochastic Completion Fields
875
Figure 6: (a) Kanizsa triangle. (b) Stochastic completion field computed by large-kernel convolution with filter generated by Monte Carlo simulation (see “Stochastic Completion Fields” elsewhere in this journal) (summed over all orientations and superimposed on the brightness gradient magnitude image). (c) Stochastic completion field computed by large-kernel convolution with filter generated by analytic expression (Thornber & Williams, 1996). (d) Stochastic completion field computed by repeated small-kernel convolution.
The third demonstration is the well-known Kanizsa triangle (Kanizsa, 1979) (see Figure 6a). The sources and sinks were identified using the method neuron’s increased firing rate. Because the wave has to arrive from both flanking elements, the function describing this latency should have the general form, max(d1 , d2 ) (where d1 and d2 are distances from the center of the neuron’s receptive field to each flanking element).
876
Lance R. Williams and David W. Jacobs
Figure 7: Numerical artifacts caused by anisotropy inherent in finite difference solution of the advection equation. (a) Completion field in neighborhood of “pacman” computed by large-kernel convolution. (b) Completion field in same neighborhood computed by repeated small-kernel convolution.
described in Williams and Jacobs (1997). Figure 6b shows the stochastic completion field computed using large-kernel convolution with a filter generated by Monte Carlo simulation (see our other article in this issue). The completion field is summed over all orientations and superimposed on the brightness gradient magnitude image for illustrative purposes. Figure 6c shows the same, but this result was computed with a filter generated by numerical integration of the analytic expression for G derived by Thornber and Williams (1996). This is by far the most accurate method of computing the stochastic completion field. Figure 6d shows the stochastic completion field computed using repeated small-kernel convolution. There are small but noticeable numerical artifacts due to the inherent anisotropy of the finite difference scheme for solving the advection equation (see Figure 7). The similarity between Figures 6b–d demonstrates that our characterization of the stochastic completion field is distinct from the algorithm (and neural network) that computes it. Although space considerations preclude our showing them here, we have rerun all of the experiments from our other article in this issue using the small-kernel algorithm and achieved results of similar quality. Finally, we have tested the small-kernel method on a picture of a dinosaur. The stochastic completion field, summed over all orientations and superimposed on an edge image for illustration purposes, is shown in Figure 8. The body of the dinosaur has been completed where it is occluded by the legs. In Figure 9 we demonstrate the effectiveness of the anisotropic decay
Computation of Stochastic Completion Fields
877
Figure 8: (a) Dinosaur image (Apatosaurus). (b) The body of the dinosaur has been completed where it is occluded by the legs.
function described in the previous section. This figure also illustrates the strong connection between the stochastic completion field and the active energy minimizing contours (“snakes”) of Kass, Witkin, and Terzopolous (1987). The stochastic completion field can be thought of as a probability distribution representing a family of snakes with given end conditions. This connection is strengthened by the inclusion of the anisotropic decay mechanism, since this device plays a role similar to the “external” energy term of the snake formulation. Figure 9a shows two probability density “waves” emitted from a pair of sources located on the dinosaur’s back. The source and sink fields were assumed equal and the time, t = 32. Figure 9b shows the stochastic completion field computed for this pair of sources. The mode of the stochastic completion field (the curve of least energy) lies significantly below the back of the dinosaur. Furthermore, the variance of the distribution is quite large. Figure 9c shows the probability density “waves” emitted from this same pair of sources but with particle half-life determined by the anisotropic decay function. The probability density is maximum where the wave intersects the back of the dinosaur. Figure 9d shows that the mode of the resulting completion field conforms closely to the back of the dinosaur and that the variance of the distribution is significantly reduced. In our last experiment, we demonstrate the effect of the anisotropic decay function using three variations of the Kanizsa square figure (Ramachandran et al., 1994). Although we compare the results of our simulation against the human percepts in each case, we make no special claims for the specific anisotropic decay mechanism described in this article. Our intention was not to propose a faithful model of human visual processing but to illustrate
878
Lance R. Williams and David W. Jacobs
Figure 9: The stochastic completion field can be thought of as a probability distribution representing a family of snakes with given end conditions. (a) Probability density at t = 32 (p(x, y, θ ; 32)) for a pair of sources placed on the back of the dinosaur. This was computed by the small-kernel method without anisotropic decay. (b) The mode of the stochastic completion field (the curve of least energy) lies significantly below the back of the dinosaur. (c) Probability density at t = 32 for same sources computed with anisotropic decay. The probability density is maximum where the wave intersects the back of the dinosaur. (d) The mode of the resulting completion field conforms closely to the back of the dinosaur.
that the local-parallel algorithm, unlike the large-kernel algorithm, allows illusory contour formation to be modulated by local image brightnesses. The first variation of the Kanizsa square, shown in Figure 10a, consists of four “pacmen” on a uniform white background. This figure is perceived as an illusory square occluding four black discs. The second, shown in Figure
Computation of Stochastic Completion Fields
879
Figure 10: (a) Kanizsa square. (b) Stochastic completion field magnitude along the line y = 64. (c) Kanizsa square with out-of-phase checkered background (see Ramachandran et al., 1994). (d) The enhancement of the nearest contrast edge is not noticeable unless the magnitudes are multiplied by a very large factor (approximately 1.0 × 106 ). (e) Kanizsa square with in-phase checkered background. (f) The peak completion field magnitude is increased almost threefold, and the distribution has been significantly sharpened.
880
Lance R. Williams and David W. Jacobs
10c, is the same except that a checkered background has been added. The phase of the checkered pattern is such that illusory contours joining the “pacmen” must traverse large brightness gradients. Ramachandran et al. report not only that the illusory square is suppressed by the addition of the out-of-phase checkered background but also that the edges of the checker pattern nearest the old illusory square are enhanced. The third variation, shown in Figure 10e, also places the “pacmen” on a checkered background, but this time the phase of the checkered pattern is such that the illusory contours encounter no large brightness gradients but instead coincide with the borders of the squares. Ramachandran et al. report that the illusory square is enhanced by the addition of the in-phase checkered background. In our simulations, the images were of size 128 × 128 with 12 discrete orientations. The positions and orientations of the sources and sinks (the corners of the “pacmen”) were defined manually. The diffusivity, σ 2 = 0.01 and the particle half-life, τ = 100. The magnitude of the stochastic completion field along the line y = 64 is displayed to the right of each of the Kanizsa squares. Figure 10b and 10f are scaled by the same factor, so the results may be directly compared. We note that the peak completion field magnitude is increased almost threefold by the addition of the in-phase checkered background and that the distribution has been significantly sharpened. The effect of adding the out-of-phase checkered background is that the stochastic completion field from Figure 10b is almost totally suppressed. To a first approximation, this is consistent with Ramachandran et al.’s observation. However, the enhancement of the nearest contrast edge, also reported in Ramachandran et al. (1994), is not noticeable unless the completion field magnitudes are multiplied by a very large factor (approximately 1.0 × 106 ). See Figure 10d. To summarize, the results of our simulations are largely consistent with the observations reported in Ramachandran et al. (1994), but the enhancement effect observed in the case of the out-of-phase checkered background is far smaller.
7 Conclusion We have shown how to model illusory contour formation using repeated, small-kernel convolutions. This method directly improves on several published models because of its greater efficiency (by several orders of magnitude) and because it allows edges of objects to be connected in a way that takes account of all relevant image intensities, forming illusory contours only when they are consistent with these intensities. Our work therefore helps to bridge the gap between work on understanding human performance on illusory contours and work in computer vision (e.g., snakes) on finding contours that are smooth yet faithful to the image intensities.
Computation of Stochastic Completion Fields
881
References Ames, W. (1992). Numerical methods for partial differential equations. San Diego: Academic Press. Feller, W. (1971). An introduction to probability theory and its applications (Vol. 2). New York: Wiley. Ghez, R. (1988). A primer of diffusion problems. New York: Wiley. Grossberg, S., & Mingolla, E. (1985). Neural dynamics of form perception: Boundary completion, illusory figures, and neon color spreading. Psychological Review, 92, 173–211. Guy, G., & Medioni, G. (1996). Inferring global perceptual contours from local features. International Journal of Computer Vision, 20, 113–133. Heitger, R., & von der Heydt, R. (1993). A computational model of neural contour processing, figure-ground and illusory contours. Proc. of 4th Intl. Conf. on Computer Vision, Berlin, Germany. Kanizsa, G. (1979). Organization in vision. New York: Praeger. Kass, M., Witkin, A., & Terzopolous, D. (1987). Snakes: Active minimum energy seeking contours. Proc. of the First Intl. Conf. on Computer Vision, London, 259–268. Marr, D. (1982). Vision. San Francisco: Freeman. Mumford, D. (1994). Elastica and computer vision. In C. Bajaj (Ed.), Algebraic Geometry and Its Applications. New York: Springer-Verlag. Niebur, E., & Worg ¨ otter, ¨ F. (1994). Design principles of columnar organization in visual cortex. Neural Computation, 6, 602–614. Nitzberg, M., & Shiota, T. (1992). Nonlinear image filtering with edge and corner enhancement. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14, 826– 833. Perona, P., & Malik, J. (1990). Scale space and edge detection using anisotropic diffusion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 12, 629–639. Ramachandran, V. S., Ruskin, D., Cobb, S., Rogers-Ramachandran, D., & Tyler, C. W. (1994). On the perception of illusory contours. Vision Research, 34(23), 3145–3152. Rock, I., & Anson, R. (1979). Illusory contours as the solution to a problem. Perception, 8, 665–681. Rubin, N., Shapley, R., & Nakayama, K. (1995). Rapid propagation speed of signals triggering illusory contours. Investigative Opthalmology and Visual Science (ARVO), 36(4). Thornber, K. K., & Williams, L. R. (1996). Analytic solution of stochastic completion fields. Biological Cybernetics, 75, 141–151. von der Heydt, R., Peterhans, E., & Baumgartner, G. (1984). Illusory contours and cortical neuron responses. Science, 224, 1260–1262. Weickert, J. (1995). Multi-scale texture enhancement. Conf. on Computer Analysis of Image and Patterns, Prague, Czech Republic, 230–237. Williams, L. R., & Jacobs, D. W. (1997). Stochastic completion fields: A neural model of illusory contour shape and salience. Neural Computation, 9, 837–858. Received April 1, 1996; accepted September 25, 1996.
Communicated by Peter Dayan
Optimal, Unsupervised Learning in Invariant Object Recognition Guy Wallis ¨ biologische Kybernetik, 72076 Tubingen, ¨ Max-Planck-Institut fur Germany
Roland Baddeley Department of Experimental Psychology, University of Oxford, Oxford OX1 3UD, UK
A means for establishing transformation-invariant representations of objects is proposed and analyzed, in which different views are associated on the basis of the temporal order of the presentation of these views, as well as their spatial similarity. Assuming knowledge of the distribution of presentation times, an optimal linear learning rule is derived. Simulations of a competitive network trained on a character recognition task are then used to highlight the success of this learning rule in relation to simple Hebbian learning and to show that the theory can give accurate quantitative predictions for the optimal parameters for such networks. 1 Introduction How might we learn to recognize an object irrespective of the precise relationship between viewer and object, characterized by viewing angle, distance, and translation? If we believe the view-centered approach to object recognition (Tarr & Pinker, 1989; Bulthoff ¨ & Edelman, 1992), then the problem becomes one of associating together a series of different views of the object that may share few, if any, of the features supporting the recognition. A broadly tuned feature-based system would be sufficient to perform recognition over small transformations (Poggio & Edelman, 1990), and the necessary receptive fields might be learned via a simple competitive network using Hebbian learning. However, large-shape transformations would require either separate prenormalization for size and translation or separate feature detectors feeding into a final arbitration layer. This then raises the question of how such a final layer might be trained, dismissing the use of any form of supervised training signal. The use of prenormalization and a final arbiter is in fact in contrast to the evidence we have from the responses of real neurons implicated in object recognition. Invariance seems to be established over a series of processing stages, starting from neurons with restricted receptive fields and culminating in the types of cell responses found in inferior temporal (IT) cortex (Perrett & Oram, 1993; Rolls, 1992). Cells in this region exhibit invariance Neural Computation 9, 883–894 (1997)
c 1997 Massachusetts Institute of Technology °
884
Guy Wallis and Roland Baddeley
to combinations of the types of transformations discussed here (Rolls, 1992; Desimone, 1991; Tanaka, Saito, Fukada, & Moriya, 1991). Miyashita (1988) has illustrated that arbitrary stimuli can be associated together by a single IT neuron in primates. The key to the training he used was the association of these images in time, not in space. The hypothesis expressed here is that temporal relations in the appearance of object views affect learning. In the real world, we see objects for protracted if variable periods while they undergo any manner of natural transformations, such as when we approach an object or rotate it in our hand. The consistent stream of images serves as a cue that all of these images belong to the same object and that it would be expedient to associate them together. This idea has been described elsewhere (Edelman & Weinshall, 1991; Foldi´ ¨ ak, 1991; Wallis & Rolls, 1997); in this article, we aim to build a theoretical framework for analyzing the response of a neuron over time. From this framework we then intend to derive the form of an optimal, local training rule given the general probability function governing the length of each exposure to images of the same object. 2 Mapping the Output of a Neuron to the Optimal Training Signal The most obvious signal present in a neuron for performing learning is the activity of the neuron itself. Unfortunately, this Hebbian training signal will not typically be optimal for learning-invariant object recognition due to erroneous classifications made by the neuron to spatially similar images from different objects and spatially dissimilar images derived from the same object. A potentially useful piece of a priori knowledge to overcome this problem is that objects tend to be viewed over variable but extended periods of time. In this work the consequent temporal structure will be used to provide information for improving the use of local neural activity as a training signal. This is achieved here by using a weighted sum of previous neuronal activity rather than by attending to the current neuronal output alone. Assuming the neuronal output to be a signal described by the function y(t), and the ideal training signal to be s(t), discrepancies between the two signals can be regarded as due to some noise signal n(t). Couched in these terms, the optimal recovery of s(t) from y(t) can be regarded as a classical filtering problem. The form of the optimal linear filter for retrieving s(t) can be derived by Wiener filtering (Wiener, 1949; Press, 1992).1 The Wiener filter, φ(t), gives an optimal estimate, s˜(t) (in the least-squares sense), of the true signal, s(t), when applied to the noise-contaminated output, y(t). Adopting the con-
1 There are presumably innumerable, more optimal nonlinear filters, which are beyond the scope of the Wiener filtering theory given here.
Learning in Invariant Object Recognition
885
vention that capital letters denote functions transformed into the frequency domain, the theorem states: ˜ f ) = Y( f )8( f ) S( 8( f ) =
(2.1)
|S( f )|2 |S( f )|2 + |N( f )|2
.
(2.2)
Hence, the optimal filter φ(t) can be determined directly from the estimated system noise n(t) and the true signal s(t). The following two sections consider the predicted optimal filter for three different forms of s(t), each describing a different regime for stimulus presentation. 2.1 Fixed-Length Presentation Times. As a simple first case, a set of stimuli are assumed to appear in identical, fixed-length sequences. Possible forms of the ideal training signal and neural output associated with this training regime appear in Figure 1. The form of s(t) is easily characterized, but there remains the question of how to represent the noise signal n(t). In these calculations it is assumed to be white, that is, uncorrelated across time, making N( f ) a constant for all f , denoted ρ. The validity of this assumption will be tested in the experimental section that follows. The power spectra of s(t) and n(t) are shown in Figure 2. The large lowfrequency bias of the training signal s(t) relative to the noise signal n(t) strengthens the hypothesis that a low-pass filter would be effective in removing noise in the training signal. Knowing the form of S( f ) and N( f ) leads to the following definition of the optimal filter 8( f ):
8( f, τ, T) =
τ2 τ 2 +ρ 2 T2
¡ πτ f ¢ ¡ ¢T 2 πτ f 4 sin2
4 sin
+π 2 f 2 ρ 2
T
0
:
f = 0;
:
f ∈ Z , f > 0;
:
otherwise,
(2.3)
where T represents the period of an entire training epoch, in which all stimuli are seen once, τ corresponds to the period over which all versions of a single stimulus are presented, and 1 is the time for a single stimulus presentation. Since the training signal is effectively sampled every 1 time units, all signal frequencies above the Nyquist frequency fN = 1/21 are subject to aliasing. Noise power is hence constrained to lie between zero and fN . Signal amplitude above fN is assumed negligible, since in general a1 À a fN .2 2
For example, if T = 1001 and τ = 101, a1 ≈ 50a fN .
886
Guy Wallis and Roland Baddeley
(a)
s(t)
~
τ
t
T
(b) y(t)
(c) ~s(t)
~
t
~
t
∆
Figure 1: (a) The ideal training signal s(t), which is one when the object is present and zero when absent. (b) Example of a possible neural output signal y(t). (c) Attempt to retrieve s(t) from y(t) by taking a weighted sum of the outputs y(t) at previous times.
The first diagram in Figure 3 depicts the optimal filters derived from the expression for 8( f ) over a range of noise values,3 for the example case τ = 101 and T = 1001. Weightings at each time step are proportional to ¯ the weighting of the cell’s activation at this time in the past, denoted φ(t). The level of noise clearly affects the form of the Wiener filter. In practice, as the neuron learns, its associated error rate will change, and with it, the form of the optimal linear filter. The large negative trough apparent in all of these filter profiles at time T/τ results from fixing the ratio τ/T. In the next section, this trough is seen to disappear when using variable-length presentation times. 3 ρ = 0.45 is equivalent to an error rate of 10 percent; ρ = 0.14 to an error rate of 1 percent.
Learning in Invariant Object Recognition
887
2
S (f)
2
Log Power
N (f)
fN
Frequency
Figure 2: S( f ) and N( f ) for a system where presentation times are fixed. fN = 1/21 is the Nyquist frequency.
0.8
ρ = 0.1
0.6
ρ= 0.5
0.4 0.2 2
4
6
8
Time
10
12
14
1 0.8
ρ = 0.1
0.6
ρ= 0.5
0.4 0.2 2
4
6
8
10
Time
12
14
Normalized Filter Weighting
1
-0.2
Fast Decay
Slow Decay Normalized Filter Weighting
Normalized Filter Weighting
Fixed
1 0.8
ρ = 0.1
0.6
ρ= 0.5 0.4 0.2 2
4
6
8
10
12
14
Time
Figure 3: Normalized Wiener filters φ(t) for the three presentation paradigms. The vertical axis represents the weighting accorded to each sample of neural activity; the horizontal axis represents the sample number in multiples of 1 time steps previous to the current sample at time zero.
888
Guy Wallis and Roland Baddeley
2.2 Variable-Length Presentation Times. In the real world one might expect to see objects over variable periods of time rather than for the fixed periods. Since presentation time is a scale parameter, it is reasonable to propose that the presentation times are in fact Jeffrey’s distributed (P(τ ) ∝ 1/τ or, equivalently, that log presentation times are uniformly distributed (see Jaynes, 1983), which implies that objects tend to be seen for short periods but are occasionally seen for much longer periods. In this section, new optimal filters are derived from two different presentation probability distributions. In the first, the maximum presentation time for a stimulus, τˆ , is set to be 1001, and the minimum presentation length to be 1. In the second, τˆ is reduced to 101. The period T is now distributed about a mean dependent on τˆ ; hence the mean value of T is given the symbol Tτˆ . Presentation-length probabilities are defined by a simple Jeffrey’s distribution extending within the range 1 < τ < τˆ , such that P(τ )|τ =τˆ = 0. If k is the number of stimulus classes, then we have: −1 − τˆ −1 ) : 1 ≤ τ ≤ τˆ P(τ ) = Z−1 τˆ (τ
Zτˆ =
τˆ X
(s−1 − τˆ −1 ) :
s=1
Tτˆ = k
τˆ X
τ P(τ ) :
τ =1
(2.4)
Z1001 = 4.187381 Z101 = 1.928971
T1001 = 11.82121k T101 = 2.332851k .
(2.5) (2.6)
The average optimal filter hφ(t)i is then: hφ(t)i =
τˆ X τ =1
P(τ ) φ(t)
(2.7)
¶ 2π f t . P(τ ) 8( f, τ, Tτˆ ) cos = Tτˆ τ =1 f =0 τˆ X
∞ X
µ
(2.8)
The second two graphs of Figure 3 show the filters for these two random presentation distributions. Switching to variable-sequence presentation lengths sees the disappearance of the trough present in the fixedlength presentation case. In addition, the shorter average presentation-time paradigm yields temporal filters with shorter time constants. 2.3 The Trace Rule. This section describes a locally implementable learning rule that can realize the general form of all of the filters described in the previous section. The learning rule produces a running average of neural activity based on a weighted sum of the neuron’s previous activity. The simple recursive form of the learning rule may be important in that it lends itself to implementation locally within a cortical neuron (Wallis & Rolls, 1997).
Learning in Invariant Object Recognition
Slow Decay, τ^ = 100 ∆
Fixed, T=100 ∆ 1 0.013
0.8
η
0.014
1
0.8
0.8
0.011
η
0.6
0.4
0.4
0.2 0.0026
0.2
η
0.3
ρ
0.4
0.5
0.6 0.4 0.2
0.0012 0.2
Fast Decay, τ^ = 10 ∆
1 0.014
0.6
0.1
889
0.1
0.00049 0.00032 0.00024 0.00019 0.2 0.3 0.4 0.5
ρ
0.0035 0.0024 0.1
0.2
0.0015 0.3
ρ
0.0011 0.00085 0.4
0.5
Figure 4: Graphs showing the change in the predicted optimal value of η with changing error rate (black lines and circles). Also shown are the least-meansquares fit errors at each noise level, shown above the vertical bars. The thick arrows indicate the predicted optimal value of η over the range of noise values shown.
The learning rule has a long history but was used most recently in invariant object recognition for orthogonal images (Foldi´ ¨ ak, 1991) and nonorthogonal images (Wallis, 1996b), and can be summarized as follows: X wij 2 = 1 for each ith neuron (2.9) 1wij (t) = αyi (t) .xj : j
yi
(t)
= (1 − η)yi
(t)
+ ηyi (t−1)
(2.10)
where xj is the jth input to the neuron, yi is the output of the ith neuron, wij is the jth weight on the ith neuron, η governs the relative influence of the trace and the new input, and yi (t) represents the value of the ith cell’s trace at time t. The left-most diagram of Figure 4 represents the main result for this section. The optimal value of the trace rule parameter η is plotted as the dark line and varies as a function of the noise ρ. The vertical bars show the relative quality of fit of the filter implemented by the trace rule to the optimal filter as a least-mean-squares fit for the first 20 steps in the filter, with actual fitting errors appearing on the top of each bar. The optimal value of η gradually drops as the noise level drops but remains around 0.8 over a very wide range of noise, suggesting that a constant trace rule could be used throughout training. The remaining two graphs of Figure 4 show the change in the best-fitting value of η as a function of ρ, for the variable presentation times. The graphs are similar to the fixed presentation time case, except that the optimal value of η is a little higher for the slow-decaying case (between 0.8 and 0.9) and somewhat lower for the fast-decaying case (0.7). The average fitting error is also considerably smaller, indicating that the training rule fits the optimal filter more closely under the two probabilistic presentation regimes. The
890
Guy Wallis and Roland Baddeley
Training Set
Cross-validation Set
Figure 5: Architecture of the network used in the simulations and the two sets of digits used during training and testing.
accuracy of these and previous predicted optimal values of η is tested in the next section. 3 Simulating the Predictions of the Wiener Filtering A series of predicted optimal linear filters, parameterized for stimulus presentation length and the classification error rate of the neurons ρ, have now been produced. To gauge the accuracy of the predictions, a series of simulations were run. 3.1 Methods. A two-layer network was constructed (see Figure 5). The first layer acts as a local feature extraction layer and consists of a 16×16 grid of neurons arranged in 4 × 4 pools. Each pool fully samples a corresponding 4x4 patch of a 16 × 16 input image. All learning in this layer is standard Hebbian. Above the input layer is a second layer, consisting of a single inhibitory pool of 10 neurons, which fully samples the first layer. Neurons in this layer are trained with the trace rule. Competition acts within each pool in both layers and is implemented using the soft max algorithm (Bridle, 1990). Digits from a character set used by LeCun et al. (1989) were presented to the network. During learning, 100 digits—10 of each type—were presented in permuted random sequence according to the three stimulus presentation distributions described in the previous section. Such a stream of digits might
Learning in Invariant Object Recognition
Fast Decay, τ^ = 10 ∆
Slow Decay, τ^ = 100 ∆
Fixed, T=100 ∆
100
80 60 40 Training Set
20
80 60 40 Training Set
20
Cross Validation Set
0
0.2
0.4
η
0.6
0.8
Performance %
100
Performance %
100
Performance %
891
80 60 40 Training Set
20
Cross Validation Set
1
0
0.2
0.4
η
0.6
0.8
Cross Validation Set
1
0
0.2
0.4
η
0.6
0.8
1
Figure 6: Classification performance for the network at differing values of η for the three presentation-length paradigms. The thick arrows indicate the optimal value of η predicted by the earlier theory, which are seen to be in accord with the optimal values to come out of the simulations.
be generated in the real world while observing a digit printed in a book, by altering the reading distance or tilting a page. Although we have chosen to concentrate on learning deformation invariance of digits here, this system has been shown to be able to cope with other transformations of more natural stimuli, such as faces, undergoing depth rotations, scale changes, and translations (Wallis & Rolls, 1997; Wallis, 1996a). Network performance was measured as both recognition of the 100 digits from the training set and a 100 digit cross-validation set from the same database. In addition to these two performance measures, the amount of noise present in a neuron’s output signal when trained from start to steady response was also recorded. 3.2 Results. The results shown in Figure 6 reveal the effect on overall classification performance of changing the value of the trace parameter η over 10 runs of the network. Under each of the three presentation paradigms, the optimal value of η predicted in the previous section is seen to correspond very closely to the peak in the performance graph displayed alongside (see the dark arrows in each of the six graphs). Therefore, despite two possible theoretical problems (the assumptions of independent errors and of linearity), the predictions from the theory are very close to the results of the simulations. For η = 0 the results correspond to simple Hebbian learning. These results are clearly much worse than those achieved using the optimized trace rule. Further comparisons with other architectures and other learning rules are described elsewhere (Wallis, 1996b). The forerunning calculations depend on the errors’ being approximately random. Figure 7 shows the average power spectrum recorded for a neuron
892
Guy Wallis and Roland Baddeley
Slow Decay
Fixed
Fast Decay
10 0
10−1
10 0
Signal Noise
Log Power
Log Power
Signal Noise
10−1
Log Power
10 0
10−2
10−2
10−3
0
5
10
15
20
25
30
35
40
10−3
0
45 50
5
10
15
Freq
20
25
30
35
40
10−2
10−3
45 50
Signal Noise
10−1
0
5
10
15
20
25
30
35
40
45 50
Freq
Freq
Figure 7: Average log power spectrum for noise n(t) and ideal training signal s(t) in the response of neurons under each of the three presentation-length training paradigms.
Slow Decay
0.8
white noise filter
0.6
true noise filter
Normalized Filter Weighting
Normalized Filter Weighting
1
0.4
0.2
0
−0.2
0
5
10
Time
15
Fast Decay
1
Normalized Filter Weighting
Fixed
0.9
white noise filter
0.8 0.7
true noise filter
0.6 0.5 0.4 0.3 0.2 0.1 0 0
5
10
Time
15
1 0.9
white noise filter
0.8 0.7
true noise filter
0.6 0.5 0.4 0.3 0.2 0.1 0 0
5
10
15
Time
Figure 8: Optimal filters calculated from the measured neural noise under each of the three presentation-length training paradigms. The noise signal used to calculate the filters was assumed to be either “white,” that is, uniform across all frequencies (dashed line), or the true low-frequency biased signal (solid line).
monitored from a random starting point to the point at which a steady-state response had been achieved. The spectrum is averaged across all 10 output cells and also across time, with the signals being binned into 100 consecutive stimulus presentations. All three graphs show a peak in the low-frequency end of the noise spectra, although the power of this signal is considerably less than the signal power, and the spectrum is essentially flat for all higher frequencies up to and beyond the point at which noise power exceeds signal power, in accordance with the form of the graph in Figure 2. Knowing the true noise signal permits recalculation of the optimal filters. Figure 8 shows the optimal filters obtained using the true low-frequency biased noise spectrum versus using a flat (white) spectrum, taken as the average of the true noise power across all frequencies. The optimal filter for the true noise case is very similar to the white noise case, although the decay through time is consistently slightly faster. The
Learning in Invariant Object Recognition
893
fact that this difference is consistent may well be attributed to the fact that one effect of the peak at low frequencies in the true noise signal is to shift the frequency at which the noise and signal spectra cross in comparison with the crossing point for the flat, white noise case. The white noise curve would cross the signal curve at the same frequency as in the true noise case if its power was reduced slightly, and so the change from white to true noise can be thought of as a reduction of the effective white noise amplitude. Figure 4 showed how lower noise values correspond to smaller time constants, consistent with the shift seen in Figure 8. Despite this shift, the stability of the optimal value of η with respect to changes in noise (see Figure 4) ensures that the predicted optimal value of η never differs by more then 0.05 between the white and true noise cases, supporting the use of a flat noise spectrum in the theory section of this article. 4 Conclusions This article has proposed that objects encountered in the world are presented in sequences of views of the same object, not randomly. This temporal regularity can then be used to help form invariant representations by training on the basis of a signal generated from a weighted sum of previous neural activity, the form of which can be determined by optimal filtering theory. Further, if we make the reasonable assumption that the log of presentation times is uniformly distributed, then this weighting can be achieved by a simple local learning rule. The validity of this theory has been tested on the recognition of digits, demonstrating that optimal parameters were well predicted and that assumptions about the form of the noise and system linearity were justified. In cases where we have explicit labels for inputs, supervised learning techniques may well be preferable, but in the natural world, the visual input to the brain often does not have such labels. In this case, any regularity in the input (spatial or temporal) should be exploited by an unsupervised system to learn useful representations. Much work has concentrated on exploiting spatial correlations. Here we show that temporal regularities can also be exploited near optimally using the simple and local trace rule. Acknowledgments We are grateful to two anonymous reviewers and this journal’s editor in chief for helping to improve the clarity and depth of this article considerably. References Bridle, J. S. (1990). Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters.
894
Guy Wallis and Roland Baddeley
In D. S. Touretzky (Ed.), Neural Information Processing (Vol. 2, pp. 211–217). San Mateo, CA: Morgan Kaufmann. Bulthoff, ¨ H., & Edelman, S. (1992). Psychophysical support for a twodimensional view interpolation theory of object recognition. Proceedings of the National Academy of Science, USA, 92, 60–64. Desimone, R. (1991). Face-selective cells in the temporal cortex of monkeys. Journal of Cognitive Neuroscience, 3, 1–8. Edelman, S., & Weinshall, D. (1991). A self-organising multiple-view representation of 3D objects. Biological Cybernetics, 64, 209–219. Foldi´ ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Jaynes, E. T. (1983). Papers on probability, statistics and statistical physics. Dordrecht: Reidel. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1, 541–551. Miyashita, Y. (1988). Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature, 335, 817–820. Perrett, D., & Oram, M. W. (1993). Neurophysiology of shape processing. Image and Vision Computing, 11(6), 317–333. Poggio, T., & Edelman, S. (1990). A network that learns to recognize threedimensional objects. Nature, 343, 263–266. Press, W. H. (1992). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press. Rolls, E. T. (1992). Neurophysiological mechanisms underlying face processing within and beyond the temporal cortical areas. Philosophical Transactions of the Royal Society, London (B), 335, 11–21. Tanaka, K., Saito, H., Fukada, Y., & Moriya, M. (1991). Coding visual images of objects in the inferotemporal cortex of the macaque monkey. Journal of Neurophysiology, 66, 170–189. Tarr, M. J., & Pinker, S. (1989). Mental rotation and orientation-dependence in shape recognition. Cognitive Psycholgy, 21, 233–282. Wallis, G. (1996a). Temporal order in object recognition learning. To appear in Journal of Biological Systems. Wallis, G. (1996b). Using spatio-temporal corrolations to learn invariant object recognition. Neural Networks, 9, 1513–1519. Wallis, G., & Rolls, E. T. (1997). A model of invariant object recognition in the visual system. Progress in Neurobiology, 51: 167–194. Wiener, N. (1949). Extrapolation, interpolation, and smoothing of stationary time series: With engineering applications. New York: Wiley.
Received February 8, 1996; accepted September 4, 1996.
Communicated by Zhaoping Lee
Activation Functions, Computational Goals, and Learning Rules for Local Processors with Contextual Guidance Jim Kay Department of Statistics, University of Glasgow, Glasgow G12 8QQ, Scotland, UK
W. A. Phillips Centre for Cognitive and Computational Neuroscience, University of Stirling, Stirling FK9 4LA, Scotland, UK
Information about context can enable local processors to discover latent variables that are relevant to the context within which they occur, and it can also guide short-term processing. For example, Becker and Hinton (1992) have shown how context can guide learning, and Hummel and Biederman (1992) have shown how it can guide processing in a large neural net for object recognition. This article studies the basic capabilities of a local processor with two distinct classes of inputs: receptive field inputs that provide the primary drive and contextual inputs that modulate their effects. The contextual predictions are used to guide processing without confusing them with receptive field inputs. The processor’s transfer function must therefore distinguish these two roles. Given these two classes of input, the information in the output can be decomposed into four disjoint components to provide a space of possible goals in which the unsupervised learning of Linsker (1988) and the internally supervised learning of Becker and Hinton (1992) are special cases. Learning rules are derived from an information-theoretic objective function, and simulations show that a local processor trained with these rules and using an appropriate activation function has the elementary properties required. 1 Introduction Many studies have shown that useful computational capabilities can arise from local processors whose outputs are a function of just the weighted sum of their inputs. It has also been shown that these capabilities can be enhanced by providing the local processors with specific contextual information that affects processing in a way quite different from that of the primary driving inputs. For example, Becker and Hinton (1992) have shown how specific information about context can be used to guide learning. They assumed multiple processing channels that operate on different input data sets across which there is statistical dependence. The aim of learning was to discover functions of the input data that make this statistical dependence Neural Computation 9, 895–910 (1997)
c 1997 Massachusetts Institute of Technology °
896
Jim Kay and W. A. Phillips
easier to compute. To do this, the mutual information in the outputs of the separate processors was maximized. The channels therefore communicated information to each other about their current outputs, but this information was used only for learning and had no effect on the short-term processing. However, contextual information can also be used to guide processing. For example, Hummel and Biederman (1992) have shown how it could be used to guide processing in a large neural net for object recognition. In their case, the contextual connections, called fast enabling links, were used to determine the phase with which an oscillatory output would be produced but without having any effect on the amplitude. This is important because it was the amplitude of the signal that conveyed information about the extent to which the receptive field inputs met the featural requirements of the local processor. Many detailed psychological findings support the approach of Hummel and Biederman (1992). Analogous proposals have emerged from the discovery of context-sensitive synchronization between the spike trains of cortical cells (Singer, 1990; Engel, Konig, ¨ Kreiter, Schillen, & Singer, 1992; Eckhorn, Reitboeck, Arndt, & Dike, 1990) and of a basis for it in horizontal intrinsic connections (Lowel ¨ & Singer, 1992). Singer (1990) and Engel et al. (1992) call the contextual inputs coupling connections, and Eckhorn et al. (1990) call them linking connections. The aspect of these connections that is crucial here is that they help specify whether a signal is relevant in some immediate context, but they do not change what it means. From a statistical point of view, the motivation for studying processors with contextual guidance is that they might implement some form of latent structure analysis that discovers the predictive relationships implicit in separate data sets. Some nets can be described as discovering the principal components of variation in the receptive field inputs. Becker and Hinton (1992) describe their algorithm as providing a nonlinear version of canonical correlation. Fisher (1936) developed a method for the extraction of canonical variates, using training data, that provided optimal linear directions of discrimination between underlying feature classes. If class labels are one of the data sets, then the method of canonical correlation (Hotelling, 1936) may be used to provide the Fisher discriminant functions. This is relevant to the case of supervised learning with an external teacher. However, as Kay (1992) showed, canonical correlation can be used in a neural network for the extraction of linear latent variables under contextual supervision. This suggests that discovery of predictive relationships between diverse data sets could be one goal of the cerebral cortex and that this goal can be formulated at the level of local circuits and can be usefully combined with the compressive recoding of the receptive field input data. This article has four aims: (1) to show how local processors can use contextual input in a way that does not confound it with the information transmitted about the receptive field; (2) to show how finding predictive relations between data sets can be combined with the compressive recoding of the receptive field input data; (3) to apply gradient ascent to the formally specified
Functions, Goals, and Rules for Local Processors
897
goals to derive learning rules for the receptive and contextual field weights; and (4) to describe simulations that test whether such a local processor has the basic properties required. There is a strong tradition in studies of cortical computation that proposes the reduction of redundancy as a major organizing principle (Barlow, 1961; Atick & Redlich, 1993). This goal can be specified locally and can be formulated as that of maximizing the mutual information between the input and the output of local processors under certain constraints (e.g., Linsker, 1988). Because this goal is formulated for local processors without contextual inputs, however, it cannot take context into account. “In a complex network, or in an animal’s brain, it is totally unclear how a component is to ‘decide’ what transformation its connections should perform. If a local optimization principle is to be used—one that does not take account of remote high level goals—then we do not know what information is going to be needed at high levels. Since we don’t know what information we can afford to discard, it is reasonable to preserve as much information as possible within the imposed constraints” (Linsker, 1988, p. 116). Given this view, it is then necessary to assume that selection of the information that is relevant within specific contexts must occur at some later and quite distinct stage of processing. The possibility being studied here is that contextual inputs provide local processors with a way of distinguishing relevant from irrelevant information even at early stages of processing. This contextual information does not necessarily have to come from remote high-level goals, however, but could be a specific local context appropriate to the position of the local processor within the system as a whole. Becker and Hinton (1992) have shown how maximizing the mutual information between the outputs of units that receive their inputs from diverse proximal data sets can provide internal supervision that enables local processors to discover distal variables that are only implicit in the input data. They formulate this goal in informationtheoretic terms. Linsker (1988) also uses an information-theoretic approach, but his goal was to maximize the mutual information between the inputs and the outputs of each local processor, without taking any contextual information into account. Section 3 shows how these two goals may be seen as specific points within a large space of possible goals. Throughout this article, it is assumed that the units of a given local processor are probabilistic with their values being described by random variables. We denote the values of the m receptive field (RF) units and n contextual field (CF) units, respectively, by R1 , R2 , . . . , Rm and C1 , C2 , . . . , Cn and we use the vector notation R and C. We make no assumptions regarding the probabilistic mechanism that produces the values of R and C; instead the empirical distributions of the input data are used, thus freeing us from rigid probabilistic modeling and allowing greater generality. It is assumed that the output of the local processor is represented by a binary random variable
898
Jim Kay and W. A. Phillips
X, with conditional output probability, Pr(X = 1|R = r, C = c) =
1 , 1 + exp(−A(sr , sc ))
(1.1)
P Pn where sr = m i=1 wi ri − w0 and sc = i=1 vi ci − v0 denote, respectively, the integrated RF and CF input fields, the {wi } and the {vi } denote the weights on the connections between the output and the RF and CF inputs, respectively, w0 and v0 are the RF and CF biases, A denotes a function that specifies the internal activation, and we have taken the scale parameter of the logistic nonlinearity to be unity. 2 Activation Functions with Contextual Guidance Becker and Hinton (1992) did not allow the contextual information to affect the short-term dynamics because they did not want the separate processors to maximize the mutual information in their outputs simply by driving each other. The use of context to affect learning without affecting ongoing processing is not only biologically implausible, however, but it also fails to use context to guide processing. We therefore require an activation function that possesses the following properties: if the integrated RF input is zero, then the activation should be zero; if the integrated CF input is zero, then the activation should be the integrated RF input; if the integrated RF and CF inputs agree, then the gain of the function relating activation to RF input should be increased; if the integrated RF and CF inputs disagree, then the gain of the function relating activation to RF input should be decreased; only the integrated RF input should determine the sign of the activation so that context cannot affect the direction of the output decision. The following function possesses all of these properties, A(sr , sc ) =
1 sr (1 + exp(2sr sc )) 2
(2.1)
and was used in the experiments described below. It is one of a class of functions of the form sr (k1 + (1 − k1 ) exp(k2 sr sc )), where k1 and k2 are constants (0< k1 <1, k2 >0). This class of functions was motivated in part by the assumption that the effect of context should depend on the prevailing state of activation and was also derived mathematically from the above requirements (Kay, 1994). This class of functions does not uniquely encapsulate the above functional requirements but nevertheless is sufficient. The activation function is illustrated in Figure 1. The activation function (see equation 2.1) is not intended to translate directly into neurophysiology, and we do not know whether any translation is possible. We do know, however, that cortical pyramidal cells in general
Functions, Goals, and Rules for Local Processors
899
Output Probability 0 .2 .4 .6 .8 1
A
4 Int 2 eg ra 0 ted CF -2 Inp -4 ut
-4
2 t 0 Inpu -2 ed RF at r g e t
4
In
0.8 0.6 0.4
Output Probability
0.0
0.2
0.8 0.6 0.4 0.2 0.0
Output Probability
1.0
C
1.0
B
-2
-1
0 Integrated RF Input
1
2
-2
-1
0
1
2
Integrated CF Input
Figure 1: (A) The surface of output probability with respect to the integrated RF and CF inputs. It is apparent when the RF and CF inputs agree that appropriate saturation takes place. The inbuilt asymmetry of the activation function is clear when one considers the quadrants where the RF and CF inputs disagree: When the RF value is positive and the CF value negative, the rate at which the probability reaches saturation as the RF value increases is depressed only slightly, whereas when the RF and CF values are reversed, the high, positive CF value cannot reach saturation with a probability of 1. (B) The output probability plotted as a function of RF activity for five fixed values of the CF activity. These cross-sectional curves are constrained to pass through the point where the probability is 0.5 and the RF is zero and they do not intersect. We see that the CF activity boosts the rate of saturation of the probability. (C) A cross-section through the probability surface for five fixed values of the RF activity. This part is strikingly different from (B). This illustrates that the probability curves cannot pass through the 0.5 barrier and that appropriate saturation cannot be achieved by the CF activity alone.
900
Jim Kay and W. A. Phillips
have voltage-dependent gain control channels. Furthermore, Hirsch and Gilbert (1991) have shown that the long-range horizonatal collaterals in V1 have a voltage-dependent rather than a driving synaptic physiology and can learn (Hirsch & Gilbert, 1993). These collaterals join pyramidal cells with nonoverlapping RFs and are a primary example of what we call contextual inputs. 3 Information-Theoretic Objective Functions To show how the use of contextual inputs extends the range of possible goals, we compare the computational goals proposed by Linsker (1988) and Becker and Hinton (1992). Linsker’s Infomax goal is to maximize the mutual information I(X; R) between the RF inputs, R, and the output X. Becker and Hinton’s explicitly stated goal was to maximize I(X; C), that is, the mutual information between the output of one channel and the contextual inputs from other channels. However, their implicit intention was to maximize the shared information between X, R, and C. To express this explicitly we introduce the concept of three-way mutual information, which is defined by I(X; R; C) = I(X; R) − I(X; R|C)
(3.1)
and equivalent versions (see Figure 2). Note that in the general form of definition (see equation 3.1), X may also be a vector and that this definition can be extended to n-way mutual information (Kay, 1994). The various information components can be most easily seen with the aid of a diagram (see Figure 2). When contextual connections from neighboring channels exist, we may consider definition 3.1 in the form I(X; R) = I(X; R; C) + I(X; R|C) so that the information shared by the output and RF inputs may be decomposed into that shared also with the CF, that is, I(X;R;C), and that which is not shared with the CF, that is, I(X;R |C). Similarly, I(X ;C | R) denotes the information that is transmitted about the contextual inputs that is not shared with the RF activity. The term H(X | R,C) denotes the conditional Shannon entropy in the output given RF and CF inputs and represents the output information shared with neither the RFs nor the CFs. It can be seen from Figure 2 that the information in the marginal output distribution may be decomposed as follows: H(X) = I(X; R; C) + I(X; R|C) + I(X; C|R) + H(X|R, C). We would normally wish to decrease I(X;C | R) and H(X|R,C), but we desire I(X;R;C) to grow and possibly also I(X;R |C). A class of objective functions that specifies these computational goals is given by the following: F = I(X; R; C) + φ1 I(X; R|C) + φ2 I(X; C|R) + φ3 H(X|R, C).
(3.2)
Functions, Goals, and Rules for Local Processors
X
901
C
H(C)
R
I(X; C | R) H(X)
H(X | R,C) I(X; R;C)
I(X; R | C)
H(R)
H(X) = I(X; R;C) + I(X; R | C) + I(X; C | R) + H(X | R,C)
Figure 2: The decomposition of the information in the output into four disjoint components. Each circle represents the total information in each component of the local processor. The inset shows the flow of information through the processor.
The parameters φ1 , φ2 , and φ3 express the importance of their respective information components relative to the three-way mutual information I(X;R;C). We now focus on the subclass of objective functions where φ2 = φ3 = 0. Taking φ1 =1, and so giving the terms I(X;R;C) and I(X;R|C) equal importance, we obtain from equation 3.1 that F = I(X;R). This is Linsker’s Infomax objective function; but note that in order to obtain a formal equivalence to his computational goal, the contextual connection would require being set to
902
Jim Kay and W. A. Phillips
zero. Becker and Hinton intended to maximize I(X;R;C), which is achieved by taking φ1 = 0. Hence, φ1 is a critical parameter; allowing it to increase from 0 to 1 makes it possible to specify the relative importance to be afforded to maximizing information transmission within channels as compared with maximizing predictability across channels. Schmidhuber and Prelinger (1993) describe an algorithm for discovering predictable classifications. Taking φ1 = 1 − ², φ2 = ², and φ3 = 0 in our system gives the the objective function F = ²I(X; C) + (1 − ²)I(X; R), which is an information-theoretic analog of their objective function. Thus, a general class of objective functions has been introduced that subsumes the computational goals of Linsker (1988) and Becker and Hinton (1992) as special cases and allows hybrid possibilities. 4 Learning Rules Learning rules for the modification of the RF and CF weights were derived using gradient ascent. A summary is presented in the appendix. The rules are ¿ À ∂A ∂F ¯ = (ψ3 A − O)p(1 − p) R (4.1) ∂w ∂sr R,C ¿ À ∂A ∂F ¯ = (ψ3 A − O)p(1 − p) C ∂v ∂sc R,C
(4.2)
where O¯ = log
ER EC E − ψ1 log − ψ2 log . (1 − E) (1 − ER ) (1 − EC )
In practice, empirical averages are taken over input patterns, which means that no explicit probabilistic modeling of inputs is required; also it is not necessary to learn explicitly the joint probability distribution of R and C; all that is required is to learn the expectations E, ER , and EC . For example, EC is the average output probability taken over all RF input patterns, for which the contextual inputs are equal to C. The weight changes 1w and 1v are taken to be proportional to the partial derivatives in equations 4.1 and 4.2, respectively. These equations are written for batch learning; online learning is performed by removing the averaging brackets. It is important to stress that the required weight changes and also the expectations may be calculated recursively and updated after the presentation of each pattern, and so online learning may be achieved without making a double pass through the data within each epoch (Kay, 1994). Online learning was used successfully
Functions, Goals, and Rules for Local Processors
903
in the experiments described in section 5. The term O¯ represents a dynamic average that can be influenced by the current CF inputs as well as by the current RF inputs. The term p(1 − p) provides intrinsic weight stabilization. The partial derivatives of activation function (see equation 2.1) are given by µ ¶ 1 1 ∂A + sr sc exp(2sr sc ) = + ∂sr 2 2 ∂A = s2r exp(2sr sc ). ∂sc The weight change specified by these rules is nonmonotonically related to postsynaptic activity in a similar way to that proposed by Bienenstock, Cooper, and Munro (1982). It is also similar to the simpler form of nonmonotonicity found by Artola, Brocher, and Singer (1990) in studies of plasticity in slices of adult rat neocortex and which has been shown to have useful computational properties by Hancock, Smith, and Phillips (1991). 5 Experiments and Discussion We now provide two different illustrations of the role of contextual guidance in the processing of receptive field data. The following experiments were conducted using a net with two channels, each having its own receptive field inputs and a single output unit, with contextual connections between the outputs. Hence, the output unit within each of the channels is an example of a local processor as defined in this article. In all of the experiments online learning was used; the initial weights were generated randomly from the uniform distribution on the interval [−0.01, 0.01]; the learning rate per input pattern was taken to be 0.5 divided by the number of input patterns; the input patterns were presented randomly to the network; the network was set to learning mode for 1000 epochs, by which time the objective function had stabilized in all of the experimental runs; and the means of the random variables representing the outputs of the local processors were communicated to each other in order to provide mutual contextual guidance. 5.1 Discovery of Latent Structure Using Contextual Guidance. The purpose of the experiments described in this subsection is fourfold: (1) to show that when the three-way mutual information is used as the objective function for each local processor, the net can “discover” the variable that is predictably related across channels, but is not the most informative variable within channels, even when the cross-channel correlation is weak; (2) to show that this variable is discovered more quickly when it is also the most informative variable within channels; (3) to demonstrate that the use of Infomax on the combined receptive field data (treated as a single data set) fails to discover the relevant variable; and (4) to demonstrate that a number
904
Jim Kay and W. A. Phillips
of other obvious activation functions do not produce the desired solution in experiment 1. Four distinct binary input patterns were generated comprising positive and negative horizontal and vertical bar patterns of size 5 × 5, the sign of the pattern being determined by the sign of the central horizontal row or vertical column of the input patterns, with all other entries of the patterns having the opposite sign. In experiment 1, a batch of 100 input patterns was used, consisting of an equal number (14) of positive and negative horizontal bar patterns and an equal number (36) of positive and negative vertical bar patterns, and these 100 patterns were presented randomly within both channels. The patterns were presented so that the horizontal bar pattern was present in both channels on 28 percent of occasions and its sign was perfectly correlated across channels. In the other 72 percent of presentations, the vertical bar pattern was presented to both channels, but its sign was uncorrelated across channels. When the objective function at each output was the three-way mutual information (φ1 = 0), the RF weights in both channels converged to the pattern displayed in Figure 3A (or its negative) and both processors signaled the sign of the horizontal bar pattern. That is, the output probability obtained when the positive and negative horizontal bar was presented to the network was 1 and 0 (or vice versa), respectively, while the vertical bar patterns produced values close to 0.5. On the other hand, when the objective function was set to Infomax (φ1 = 1 and the cross-channel contextual connections fixed at 0), the result was different. The RF weights in both channels converged to the pattern shown in Figure 3C (or its negative), but this time the sign of the input pattern was signaled by the processors without any distinction as to whether the horizontal or vertical bar pattern was present. That is, the output probability obtained when the positive horizontal and vertical bar patterns were presented was 1, and it was 0 for the negative patterns (or vice versa). Thus, we conclude in the Infomax case that the four distinct input patterns are clustered into two groups defined solely by the sign of their bar pattern, but the presence of a horizontal or vertical bar pattern cannot be distinguished. On the other hand, when the three-way mutual information is the objective function, the sign of the horizontal bar is signaled by each processor; this is the variable correlated (weakly) across channels and indicates the relevance of the horizontal bar pattern, as opposed to the vertical one, with respect to the current context. This demonstration illustrates the critical role played by the φ1 parameter in the objective function and the role of contextual guidance in selecting and signaling the relevant variable in the receptive field data. The patterns shown in Figure 3 are very close to those obtained by performing a traditional principal component analysis on the data and are, approximately, the second (A) and first (C) principal components, respectively. In experiment 2, the relative frequencies of presentation of the horizontal
Functions, Goals, and Rules for Local Processors
905
4 3 2 1
1
2
3
4
5
B
5
A
1
2
3
4
5
1
2
4
5
4
5
4 3 2 1
1
2
3
4
5
D
5
C
3
1
2
3
4
5
1
2
4 3 2 1
1
2
3
4
5
F
5
E
3
2
4
6
8
10
2
4
6
8
10
Figure 3: (A) The converged RF weights when the sign of the horizontal bar pattern is correlated across channels but not the most informative variable within channels, and the computational goal was to maximize the three-way mutual information. (C) The converged RF weights when the computational goal was Infomax. (B, D) The signs of the weights in (A) and (C), respectively. The converged weights when the RF and CF inputs were concatenated and presented in a single channel, and the objective was Infomax, are displayed in (E) and their signs in (F).
and vertical patterns were reversed, with a horizontal and vertical bar pattern present on 72 percent, and 28 percent, of occasions, respectively. In this case, the sign of the horizontal bar is the most informative variable within channels as well as being correlated across channels. In both cases, when the three-way mutual information and Infomax were employed as objective functions, the RF weights converged to a pattern of the form in Figure 3A; when the three-way mutual information was used, the speed of learning was greater than in the previous experiment, as is illustrated in Figure 4. We divided the input data into two distinct channels, which we termed the RF and CF inputs. In experiment 3, we reconsider the first experiment but present the combined RF and CF input data within a single channel and
Jim Kay and W. A. Phillips
0.0
0.05
Three-Way Mutual Information 0.10 0.15 0.20 0.25
0.30
906
0
200
400 600 Number of Epochs
800
1000
Figure 4: The three-way mutual information during the course of learning in the experiments involving the horizontal and vertical bar patterns. The upper two graphs show the objective function for both local processors in the case where the sign of the horizontal bar is correlated across channels and also the most informative variable within channels. The lower graphs were obtained in the case where the sign of the horizontal bar was correlated across channels but not the most informative variable within channels.
attempt to discover the sign of the horizontal bar pattern using Infomax as the objective function. Figure 3E shows a typical example of the stabilized RF weights in this case. There are six input patterns. The output probability was 1 when the 5 × 10 pattern was a positive horizontal bar pattern, a concatenation of two positive vertical bar patterns and a concatenation of a negative with a positive vertical bar pattern and 0 for the other three inputs; thus, the stabilized net fails to signal unambiguously the sign of the horizontal bar. Of course, this is hardly surprising. After all, there is nothing in the Infomax goal to signify the additional information that the sign of the horizontal bar pattern is the relevant variable; that is where contextual guidance is required. This simple experiment demonstrates the value of separating the input data into two separate channels.
Functions, Goals, and Rules for Local Processors
907
Finally, in experiment 4, experiment 1 was conducted using different separable activation functions: sr + sc , sr sc , sr + sr sc , and sr exp(sc ). With the exception of sr sc , with which no learning happened, the use of these activation functions failed to produce the desired result: the signaling of the sign of the horizontal bar pattern. An Infomax-type solution was obtained with the output probability being 1 for both positive patterns and 0 for the negative ones (or vice versa). This demonstrates that a nonseparable activation function is required and that the type defined in equation 2.1 is sufficient. 5.2 Contextual Guidance from a Stochastic Supervisor. The purpose of this experiment is to discover whether the provision of incomplete information about the “true” class of an input pattern, in the form of contextual guidance, improves on the performance of unsupervised classification. Real data were used from a study of a rare hypertensive sydrome (Conn’s syndrome) which can be due to either a benign tumor in the adrenal cortex (type A) or bilateral hyperplasia of the adrenal glands (type B) (Brown, Fraser, Lever, & Robertson, 1971). Data are available on 31 patients comprising age, blood pressures, and five blood plasma measurements (Aitchison & Dunsmore, 1975), thus providing eight-dimensional input patterns; 20 of the patients are known, postsurgery, to have type A and the remaining 11 type B. The data for each variable were recoded as +1 for an observation above the mean value and −1 otherwise. Real examples raise an important computational issue for the approach defined in this article: that it is required to store a conditional average ER for each distinct input pattern. Clearly with large data sets, this could produce a serious computational overload. However, it transpires that this is not a problem at all; there is an exact simplification of the mathematics in this case. Each distinct input RF pattern corresponds to only one CF pattern, and hence the conditional average ER becomes the current output probability, and so this potential storage problem disappears (Kay, 1994). It is not even necessary to rewrite this case as a special form of the learning rules as the recursive, online nature of the algorithm takes care of it automatically. In the experiment, the RF data were the 31 eight-dimensional binary patterns. The CF data were generated as follows. The true class types were represented as (0,1) or (1,0) patterns. These were linearly transformed to (−1, 1) and (1, −1), respectively. Then each of these patterns was multiplied by a random number generated from the uniform distribution on the interval [0.2, 0.8]. This corresponds to the assumption that the probability for the correct class varies randomly and uniformly between 0.6 and 0.9. These patterns provided incomplete information as to the true class of each RF input pattern, and these CF inputs provided stochastic supervision. When the computational goal was Coherent Infomax, the architecture employed consisted of eight RF inputs and two CF inputs in two different channels. Each channel had an output unit, and these provided each other with cross-
908
Jim Kay and W. A. Phillips
channel contextual guidance. When the computational goal was Infomax, the stochastic units were combined with the data within a single RF in order to provide a fairer comparison with Coherent Infomax. The network was run with the objective function set to Infomax (φ1 = 1) and to the three-way mutual information (φ1 = 0), but in each case the weights were initialized from the same random numbers. This procedure was performed five times, each time using a different seed of the random number generator. The results were as follows. In Infomax mode (completely unsupervised learning) the output probabilities of the patterns saturated and so divided the inputs into two groups. Comparing these results with the knowledge about the true classes, the number of misclassification errors was 7, 7, 8, 8, 8 on the five runs, and the average misclassification rate was 24.5 percent. On the other hand, when contextual guidance was applied, using the three-way mutual information as the objective function, the same five patterns were misclassified on all runs, giving an average misclassification rate of 16.1 percent. Hence this experiment demonstrates that the provision of additional incomplete information about the appropriate classification of the input RF patterns, in the form of stochastic supervision via contextual guidance, can provide a more appropriate grouping of the RF input patterns. Furthermore, this may be achieved in practice using the methodology introduced in this article. 5.3 Conclusions. These studies show that local processors can discover functions of the RF inputs that are predictably related to the context in which they occur and can use those predictions to guide processing without confusing them with the information transmitted about the RF. They also demonstrate that the methodology introduced here possesses the required computational capabilities. The methodology has recently been extended to deal with the incorporation of multiple layers and local processors having multiple output units. The capabilities of these more complex networks are currently under investigation and will be reported. Acknowledgments The contribution of W. A. Phillips to this work was funded in part by a Human Capital and Mobility Network grant from the European Community, contract number CHRX-CT93-0097. Appendix For further details of these derivations, see Kay (1994). It is convenient to rewrite the objective function (see equation 3.2) in the following form, F = H(X) − ψ1 H(X|R) − ψ2 H(X|C) − ψ3 H(X|R, C),
(A.1)
Functions, Goals, and Rules for Local Processors
909
where ψ1 = 1 − φ2 , ψ2 = 1 − φ1 , and ψ3 = φ1 + φ2 − φ3 − 1. Biases are accommodated by the usual practice of introducing additional inputs clamped at −1. We require the partial derivatives of F taken with respect to w and v. Recall from equation 1.1 that the output unit is bipolar and that the conditional probability that the output is +1 is given by a logistic nonlinearity. It follows that the conditional output entropy is given by H(X|R, C) = −hp log p + (1 − p) log(1 − p)iR,C ,
(A.2)
where the output probability p = 1/(1 + exp(−A(sr , sc )) and the activation function A is that defined in equation 2.1. The notation h· · ·iR,C denotes the operation of taking the average with respect to the joint distribution of R and C. It follows that ¿µ ¶ À p ∂A ∂H(X|R, C) = − log p(1 − p) R (A.3) ∂w (1 − p) ∂sr R,C ∂H(X|R, C) =− ∂v
¿µ
¶ À p ∂A log p(1 − p) C . (1 − p) ∂sc R,C
(A.4)
In order to deal with the other entropy terms in equation A.1, we require expressions for the marginal probability that X = 1 as well as the conditional probabilities Pr(X = 1 | R = r) and Pr(X = 1 | C = c). Given the simplicity of the binary output case, these are the averages of p taken, respectively, over the joint distribution of R and C, the conditional distribution of C given that R = r and the conditional distribution of R given that C = c. These terms are denoted, respectively, by E, ER , and EC . Now using result A.2, applying differentiation to yield results similar to equations A.3 and A.4, and collecting terms gives equations 4.1 and 4.2. References Aitchison, J., & Dunsmore, I. R. (1975). Statistical prediction analysis. Cambridge: Cambridge University Press. Artola, A., Brocher, S., & Singer, W. (1990). Different voltage-dependent thresholds for the induction of long-term depression and long-term potientiation in slices of the rat visual cortex. Nature (London), 347, 69–72. Atick, J. J., & Redlich, A. N. (1993). Convergent algorithm for sensory receptive field development. Neural Comp., 5, 45–60. Barlow, H. B. (1961). Possible principles underlying the transformations of sensory messages. In W. A. Rosenblith (Ed.), Sensory Communication (pp. 217– 234). Cambridge, MA: MIT Press. Becker, S., & Hinton, G. E. (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature (London), 355, 161–163.
910
Jim Kay and W. A. Phillips
Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuronal selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci., 2, 32–48. Brown, J. J., Fraser, R., Lever, A. F., & Robertson, J. I. S. (1971). Abstracts of World Medicine, 45, 549–644. Eckhorn, R., Reitboeck, H. J., Arndt, M., & Dicke, P. (1990). Feature linking among distributed assemblies: Simulations and results from cat visual cortex. Neural Comp., 2, 293–306. Engel, A. K., Konig, ¨ P., Kreiter, A. K., Schillen, T. B., & Singer, W. (1992). Temporal coding in the visual cortex: New vistas on integration in the nervous system. Trends in Neuroscience, 15, 218–226. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–188. Hancock, P. J. B., Smith, L. S., & Phillips, W. A. (1991). Neural Comp., 3, 201–212. Hirsch, J. A., & Gilbert, C. D. (1991). Synaptic physiology of horizontal connections in the cat’s visual cortex. J. Neurosci., 11, 1800–1809. Hirsch, J. A., & Gilbert, C. D. (1993). Long-term changes in synaptic strength along specific intrinsic pathways in the cat visual cortex. J. Physiol., 461, 247– 262. Hotelling, H. (1936). Relation between two sets of variates. Biometrika, 28, 321– 377. Hummel, J. E., & Biederman, I. (1992). Dynamic binding in a neural network for shape recognition. Psych. Rev., 99, 480–517. Kay, J. (1992). Feature discovery under contextual supervision using mutual information. In Proceedings of the 1992 International Joint Conference on Neural Networks (Baltimore) (Vol. 4, pp. 79–84). Kay, J. (1994). Information-theoretic neural networks for the contextual guidance of learning and processing: Mathematical and statistical considerations (Tech. Rep.) Aberdeen, UK: Biomathematics and Statistics Scotland, Macaulay Land Use Research Institute. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21, 105– 117. Lowel, ¨ S., & Singer, W. (1992). Selection of intrinsic horizontal connections in the visual cortex by correlated neuronal activity. Science, 255, 209–212. Schmidhuber, J., & Prelinger, D. (1993). Discovering predictable classifications. Neural Comp., 5, 625–635. Singer, W. (1990). Search for coherence: A basic principle of cortical selforganization. Concepts in Neuroscience, 1, 1–26.
Received April 21, 1994; accepted August 2, 1996.
Communicated by Peter Dayan
Marr’s Theory of the Neocortex as a Self-Organizing Neural Network David Willshaw Centre for Cognitive Science, University of Edinburgh, Scotland, UK
John Hallam Department of Artificial Intelligence, University of Edinburgh, Scotland, UK
Sarah Gingell Centre for Cognitive Science, University of Edinburgh, Edinburgh, Scotland, UK
Soo Leng Lau Department of Artificial Intelligence, University of Edinburgh, Edinburgh, Scotland, UK
Marr’s proposal for the functioning of the neocortex (Marr, 1970) is the least known of his various theories for specific neural circuitries. He suggested that the neocortex learns by self-organization to extract the structure from the patterns of activity incident upon it. He proposed a feedforward neural network in which the connections to the output cells (identified with the pyramidal cells of the neocortex) are modified by a mechanism of competitive learning. It was intended that each output cell comes to be selective for the input patterns from a different class and is able to respond to new patterns from the same class that have not been seen before. The learning rule that Marr proposed was underspecified, but a logical extension of the basic idea results in a synaptic learning rule in which the total amount of synaptic strength of the connections from each input (“presynaptic”) cell is kept at a constant level. In contrast, conventional competitive learning involves rules of the “postsynaptic” type. The network learns by exploiting the structure that Marr assumed to exist within the ensemble of input patterns. For this case, analysis is possible that extends that carried out by Marr, which was restricted to the binary classification task. This analysis is presented here, together with results from computer simulations of different types of competitive learning mechanisms. The presynaptic mechanism is best known in the computational neuroscience literature. In neural network applications, it may be a more suitable mechanism of competitive learning than those normally considered.
Neural Computation 9, 911–936 (1997)
c 1997 Massachusetts Institute of Technology °
912
David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau
1 Introduction In the early 1970s, two interlinked theories for the role of the neocortex and the hippocampus in memory storage and retrieval were proposed by Marr (1970, 1971). He suggested that the hippocampus is the site for temporary storage of the raw information received by the brain, over a time scale of typically one day. Periodically this information is transferred to the neocortex, where it undergoes refinement before being stored there in a permanent form. As well as being among the first theories linking the function of these two structures with their underlying anatomy, Marr’s ideas contain several proposals for training neural networks that remain relevant. However, most of his computational neuroscience work has not been in the mainstream of either brain theory or the theory of neural network devices. Previous work (Willshaw & Buckingham, 1990) explicated and extended his theory of the hippocampus (Marr, 1971). In this article we examine and extend his theory of the neocortex, which was concerned with how the cells in the neocortex learn to self-organize their responses to the structured information presented to it (Marr, 1970). Marr proposed that the pyramidal cells of the neocortex extract the structure contained in the patterns of activity impinging on the sense organs by being trained to identify the natural groupings, or classes, into which these patterns fall. For this task, he proposed a feedforward neural network with three layers of cells: input cells, which project to codon cells, which in turn project to an output layer of pyramidal cells. The codon cells are intended to extract the salient features from the input patterns, each active codon cell signaling the presence of a particular feature in an input pattern. Training the network involves presenting a succession of input patterns. At each presentation, the strengths of the connections between codon cells and output cells are modified according to a learning rule that enables the output cell responses to self-organize. Ultimately each output cell comes to be identified with one of the classes of input patterns. Once training is complete, presentation of a new input pattern (which may not have been seen before) from one of the classes causes the output cell that has become identified with that class to respond. How the codon cells are wired up to the input cells so that each responds to the presence of a different feature in an input pattern depends on the precise structure of the input information. Marr assumed that the inputcodon wiring is carried out during a pretraining phase. Once the wiring is set up, it remains fixed, and it is the contacts between codon cells and output cells that are modified during training. The codon layer is found to be essential for several reasons. The number of possible input patterns is assumed to be so large that it is unlikely that any input pattern will be presented more than once. The codon representation allows measures of similarity between patterns to be calculated that can then be used to assign a
Marr’s Theory of the Neocortex as a Self-Organizing Neural Network
913
new, unknown pattern to the correct class. Furthermore, the codon layer can be used to infer the presence of features that are missing when incomplete patterns are presented. In addition, the recoding performed by the codon cells makes the individual classes of input patterns more distinguishable than without a recoding. Marr proposed a self-organizing synaptic learning rule, which causes the weight of each synapse between a codon cell and an output cell to come to record the probability that if the pattern presented contains the feature associated with the codon cell, then the pattern is of the class represented by the output cell. Thereafter, on presentation of an input pattern, the mean strength of the synapses measured over the active codon cells is the best estimate (in information theory terms) of the probability that the pattern belongs to the class identified with the output cell. The value of this quantity determines whether the output cell should fire. 1.1 Scope of This Article. Marr addressed separately the problems of how to set the fixed input-to-codon connections and the variable codon-tooutput connections. Here we present our analysis of the second problem. We first show that for the sole case that he considered, that of a single output cell learning to make a binary classification, his proposed scheme has several shortcomings. His specification of the learning rule is unsatisfactory, and some of the neural wiring that he introduces to justify his scheme is both unjustifiable and unnecessary. We describe our proposed solutions to these difficulties. We then describe our extension of the basic scheme, to deal with more complicated classification tasks and to the case of competitive learning among a number of output cells. Preliminary results are contained in Lau (1992) and Gingell (1994). The problem of codon formation will be dealt with in a subsequent publication. 2 Formalism Marr supposed that there is a potentially infinite number of input patterns, or events, which are grouped into a number of different classes. He used as an illustration of a class the set of events that represent various types of poodle. Each single event that represents a different occurrence of a poodle contains certain, but not all, of the features of the poodle class. The assignment of features to events in the same class is highly structured. Two events that represent different poodles have many features in common—many more than the number of features shared by an event representing a poodle with one that does not. Marr referred to such clusters of events over the set of input fibers as mountains. The simplest problem that can be considered is where the network has a single output cell and it has to learn to distinguish between the inputs from two mountains, occurring equiprobably. The result of successful training would be that the output cell would respond to all the members from one mountain and to none from the other. Marr called
914
David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau
this the spatial recognizer effect because the success of the training regime depends on the fact that, within the space of all events, events in the same class are, on average, closer together (i.e., share more features) than events from different classes. In order to be able to attack the spatial recognizer effect on its own, Marr assumed an idealized input-codon transformation, and he supposed that the input patterns define the following distribution of activity in the codon layer, containing N cells (see Figure 1). Each input event is defined by activity in a selection K out of M cells in the codon layer. K is supposed to be very ¡ ¢ much smaller than M so that the number of possible input patterns, M K , is very large. There are two mountains, and they are distinguishable from one another because the K active codon cells for an event from mountain 1 are chosen from the first M cells in the codon layer, labeled from 1 to M, and the K cells of mountain 2 are chosen from the last M cells, labeled from N − M + 1 to N. M is greater than N − M + 1, and so some codon cells are in both mountains. This was intended to be a simple way of defining structure within potentially large ensembles of events. 3 The Spatial Recognizer Effect Designating the binary valued activity of codon cell i as Ci and the strength of its synapse with the output cell as Wi , the output cell is supposed to fire if the sum (1/K)6i Ci Wi exceeds some threshold value, 2. The problem is to find values for the weights and the threshold such that the output cell responds to events from one mountain only. If each event belongs to one mountain only, the problem is soluble in principle because the two ensembles of events are linearly separable. However, the solution is not straightforward, for two reasons: Each event that occurs may never reoccur and so it is difficult to collect statistics about which event belongs to which class; and in pattern classification terms, these data are supplied unlabeled, in that information about the identity of the mountain to which each event belongs is not available explicitly. This prevents standard supervised training algorithms, such as Perceptron (Minsky & Papert, 1969), from being used. 3.1 Collecting the Evidence. If the classes into which the events fall can be identified, Marr points out that a common method for assigning a newly presented event E to one of the classes Äj is to choose that Äj that maximizes the probability P(E|Äj ) of E belonging to one of the classes {Äj }. This is the maximum likelihood solution (Kingman & Taylor, 1966), and it works best when the Äj can be regarded as random variables and the conditional probabilities P(E|Äj ) are known. However, here an individual event E may occur only once, and so the appropriate values of P(E|Äj ) may not be available. Because of the structure in the data, the probability that a hitherto unknown event E belongs to a particular class will depend on its similarity with other
Marr’s Theory of the Neocortex as a Self-Organizing Neural Network
915
A
1
N-M+1
M
N
Mountain 1 Mountain 2
B
Mountain 4
Mountain 1
N
1
2
3
Mountain 2 Mountain 3
Figure 1: How N codon cells are assigned to mountains, each containing M cells. (A) Two-mountain case. (B) Four-mountain case with cyclical boundary conditions.
916
David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau
events that belong to that class. This requires a computation that is different from that used in maximum likelihood estimation. Similarity between events is defined in terms of the number of features that they have in common. Marr suggested that the purpose of training the network is to record the values of P(Äj |Ci = 1), the probability that the event that contains feature i is of class j, in the weight between the codon cell identified with feature i and the output cell identified with class j. When a new event E is presented for classification, certain codon cells become active, giving K estimates P(Ä|Ci = 1) (written as P(Ä|Ci )) that the event is of type Ä. In Marr’s terminology, this is the evidence that the event is of type Ä, and the problem is how to combine these K estimates. He uses an information theory argument to show that using any combination other than their arithmetic mean increases the information inefficiency (Kullback-Leibler distance) between the true and estimated probability distributions (Sibson, 1969). Therefore, the best estimate of the probability that the presented event E is of class Ä is N X i=1
Ci P(Ä|Ci )/
N X
Ci ,
i=1
P where N i=1 Ci is the number of estimates to be combined. Another way of looking at this is that each estimate P(Ä|Ci ) records the probability that an event, drawn from the set of events that activates codon cell i, belongs to Ä. If an event E is in the set of events that activates exactly those codons concurrently, the best estimate of the probability of that event being in Ä is the minimal entropy combination of the various estimates of P(Ä|Ci ). The expression Marr derived thus represents the probability that an event from the set that jointly activates the codons concerned is in Ä. He writes this probability, perhaps confusingly, P(Ä|E). In effect, Marr is attempting to estimate the probability P(Ä|C1 , C2 , C3 , . . .) from the individual probabilities P(Ä|C1 ), P(Ä|C2 ), P(Ä|C3 ), . . . . This is an underconstrained problem, and his use of information theory arguments could be regarded as a way of imposing extra constraints. Other possible formulations of this estimation problem are discussed in section 8. 3.2 Labeling the Data. These calculations can be carried out only if output cells have been assigned to classes so that it is known which output cell is to be activated by any given input pattern. Marr proposed that this output cell selection problem can be solved by the anatomy of the cortex; for example, for the two-mountain case, one of the codon cells that is active for patterns from just one of the two mountains should be able to drive the output cell directly (in a similar way to a climbing fiber driving a cerebellar Purkinje cell; Marr, 1969). In this way, the output cell would be biased from the start to respond to a proportion K/M of the events from this mountain.
Marr’s Theory of the Neocortex as a Self-Organizing Neural Network
917
Marr speculated that the Martinotti cells of the cerebral cortex carry out this function. 3.3 Making the Calculation. If the weight of synapse i holds the value P(Ä|Ci ), a measure of P(Ä|E) can be obtained by summing up the weights on the active codon cells and dividing by the number of active input cells, K. The output cell is then placed in the active state if P(Ä|E) exceeds some threshold value 2, which represents the minimum convincing evidence for E being in Ä; or when (1/K)
N X
P(Ä|Ci ) > 2.
i=1
Given that output cells have been assigned to classes, to realize this procedure it is required to (1) store conditional probabilities P(Ä|Ci ) in the weights of the codon-output cell synapses, (2) calculate the number, K, of active codon cells, and (3) set the threshold on the output cell. To store conditional probabilities, it is required to count the number of occasions, N1, on which codon cell i is active and the number of occasions, N2, on which it and the output cell representing the particular class are jointly active so that the ratio N2/N1 can be formed. Marr suggested (without giving a method) that these numbers can be computed incrementally. He suggested that the threshold should be set so that the output cell fires the correct proportion of times. For the two-mountain case, he showed analytically that the use of these procedures results in the output cell learning to fire in response to all the events belonging to the mountain specified by the climbing fiber cell and to none of the events from the other mountain. 4 Preliminary Investigations of Marr’s Scheme 4.1 Batch/Serial Update, Climbing Fiber Present/Absent. In preliminary studies, we looked at two possibilities for updating the values of N1 and N2: after presentation of each pattern (serial update) or after a specified number of events had been presented for classification (batch update). Two output selection strategies were examined. Following Marr, we looked at the case where one of the codon cells acted as a climbing fiber. Initially this codon cell had a contact of unit strength to the output cell, and the values of all other synapses were randomly chosen (climbing fiber present). We compared this with climbing fiber absent, where the initial values of all synapses were chosen at random, implying that the mountain to be identified with the output cell would also be chosen at random. This defines four different learning procedures. The threshold was set so that the output cell fired with the correct frequency. It was set either directly (for batch update) or by using a rule for adjusting the threshold incrementally after each pattern presentation (for serial update).
918
David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau
We carried out simulation runs for the binary classification task. Of the four different cases, the combination of batch update, climbing fiber present was best and serial update, climbing fiber absent was worst. This is unfortunate because it is difficult to imagine that patterns will be presented in batches to the neocortex and that the numbers N1 and N2 are counted separately during training, the ratio N1/N2 then being calculated once training is complete. In addition, climbing fiber present is possible only when the codon cell driving the climbing fiber is unique to just one mountain, which in general cannot be so. We now show how a computationally efficient version of the more biologically plausible method of serial update, climbing fiber absent can be made to function. 4.2 Serial Update, Climbing Fiber Absent. We developed a method for calculating conditional probabilities incrementally that was suggested by Minsky and Papert (1969). They showed that by employing this scheme, the value of the weight between an input cell and an output cell converges to the proportion of times that the output cell also fires when the input cell fires, which is the measure of probability required. Let the codon cell i and the single output cell have binary-valued activities Ci and O, respectively. The rules for changing the weights and the threshold are 1Wi = α(−Wi + Wmax O)Ci ,
(4.1)
12 = α(−2 + 2max O),
(4.2)
where α is a learning constant and Wmax , 2max set the maximum values of Wi , 2. The threshold 2 is treated as a weight from an input unit that is always active. The weight update rule has similarities with other schemes (Rumelhart & Zipser, 1985; Grossberg, 1976), as discussed in section 7. The values of α, Wmax , and 2max have to be specified. Since the weights are intended to record conditional probabilities, it would seem logical to set the value of Wmax to 1. The value of α determines the speed of the process and can be set empirically. Preliminary experiments (Gingell, 1994) showed that using these rules for updating the weights and the threshold, the two-mountain problem can be learned easily by the system. However, performance depended on the value of 2max . Setting it at a low value resulted in the output cell responding to either mountain; with a high value of 2max , it learned to respond to neither. In order to investigate this situation more thoroughly and to provide analysis of the more general case of many mountains, we developed the following formalism.
Marr’s Theory of the Neocortex as a Self-Organizing Neural Network
919
5 Analysis of the General Case There are N codon cells and Q mountains. The events belonging to any particular mountain can cause activity in any of M particular codon cells with equal probability. Within a mountain, the events occur with equal probability. To determine which codon cells are identified with which mountains, imagine that the N codon cells are arranged around a circle. Mountain 1 is identified with codon cells 1, 2, 3, . . . , M, mountain 2 with cells 1 + N/Q, 2 + N/Q, 3 + N/Q, . . . , M + N/Q, and so on. In this way the codon cells for each mountain are arranged regularly around the circle, the block of codon cells for one mountain being shifted along the circle by N/Q codon cells to give the block for the next mountain (see Figure 1). Since each of the Q mountains is identified with M codon cells, a single codon cell will be active for the events from R different mountains, where R = MQ/N.
(5.1)
The other important parameter is K, the number of codon cells active per input event, which is assumed constant. It is useful to introduce the parameter ² by the relationship K=
(R − 1)N (1 + ²). Q
(5.2)
² is positive and measures the amount by which the parameter K exceeds the minimum value it must have in order for each input pattern to be unique to a mountain. By investigating the conditions under which the events from certain mountains will cause the output cell to fire and those from other mountains prevent it from firing, we show how the value of 2max determines the number of mountains that activate the output cell. 5.1 A Single Mountain. The analysis establishes the conditions under which events from a certain number of different mountains can activate the output cell once each weight value has come to record the proportion of occasions on which the output cell fires when the appropriate codon cell fires (Minsky & Papert, 1969). Since each codon cell is identified with R mountains occurring equiprobably, if a single mountain causes the output cell to fire, each weight from codon cells in that mountain will have value 1/R, and all other weights will have value 0. Similarly, if only one of the Q mountains activates the output cell, the threshold will have a value 2max /Q. Therefore for events from the chosen mountain to fire the output cell, 2max 1 > . R Q
920
David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau
Of the other mountains, the one most likely to fire the output cell is the one that shares most codon cells with the given mountain. To prevent events from this other mountain from firing the output cell, µ ¶ N 2max N 1 + 1 − (R − 1) .0 ≤ , (R − 1) Q KR QK Q ¶ µ R−1 N. K2max ≥ R Putting these two inequalities together and expressing K in terms of ² yields the result Q Q ≤ 2max < . R(1 + ²) R
(5.3)
In other words for small ², setting 2max at slightly less than the ratio Q/R will ensure that the events from a single mountain, and only those events, activate the output cell. 5.2 Limits on Parameter Values. The number of mountains, R, in which one codon cell is involved must be necessarily less than the number of mountains to be distinguished, Q. As the value at which the free parameter 2max must be set varies with 1/R (see equation 5.3), for large R the value of 2max that enables just one mountain to activate the output cell will not differ much from the value enabling two mountains to excite the output cell (see section 5.3). In any real system, there will be a limit on the maximum value of R (and thus the maximum value of Q), which will depend on the precision to which 2max can be set. The value of K, the number of codons active per input pattern, must be less than M, the number of codon cells per mountain. It must also exceed a lower limit, to ensure that each event belongs to just one mountain. K can vary within a range of N/Q since M−
N < K < M. Q
Again, for large Q (and R) there is limited freedom in the choice of K. Figure 2 shows the results of some illustrative computer simulations for 90 codon cells and Q = 5 mountains. In both training and testing phases, ¡ ¢ the patterns were chosen from the Q M K possible patterns. Each run was repeated 100 times from the same initial sets of weights but with different training and testing sequences. The value of 2max was varied over the range of values for which it is predicted that just a few mountains can excite the output cell. The range over which one mountain can excite the output cell is between 1.55 and 1.65, which is consistent with that predicted from the theory.
Marr’s Theory of the Neocortex as a Self-Organizing Neural Network
Percentage
1; 0.50, 0.09
1.05; 0.45, 0.07 100
100
100
80
80
80
80
60
60
60
60
40
40
40
40
20
20
20
20
0
0
0
0
Percentage
1.25; 0.44, 0.07
1.3; 0.39, 0.09
1.35; 0.35, 0.07
100
100
100
100
80
80
80
80
60
60
60
60
40
40
40
40
20
20
20
20
0
0
0
1.4; 0.33, 0.06
Percentage
1.15; 0.45, 0.04
100
1.2; 0.47, 0.05
1.45; 0.30, 0.04
0 1.5; 0.30, 0.04
1.55; 0.29, 0.03
100
100
100
100
80
80
80
80
60
60
60
60
40
40
40
40
20
20
20
20
0
0
0
0
1.65; 0.25, 0.04
1.6; 0.25, 0.04
Percentage
1.1; 0.45, 0.04
1.7; 0.24, 0.04
1.75; 0.21, 0.04
100
100
100
100
80
80
80
80
60
60
60
60
40
40
40
40
20
20
20
20
0
2 4 Mountain number
0
2 4 Mountain number
0
921
2 4 Mountain number
0
2 4 Mountain number
Figure 2: A single output network that was trained on five mountains. N = 90, R = 3, Q = 5, K = 40, α = 0.01. 2max was varied between 1.0 and 1.75 in steps of 0.05. The initial values of the weights and the threshold, 2, were chosen at random from the uniform distributions in the intervals {0, 1}, {0, 2max }, respectively. Training and testing each involved the presentation of 1000 different input patterns, repeated 100 times from the same initial weight values but with different random selections of input patterns. At the end of each test run, the mountains were renumbered according to the percentage of times the test events from the various mountains excited the output cell, the most active mountain being renumbered 1. The bar charts show the mean and standard deviations of these percentage values for each mountain, calculated over the 100 runs. The three numbers above each plot give the value of 2max and the mean and standard deviation of the actual value of 2.
922
David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau
5.3 More Than One Mountain. The conditions under which the output cell can respond to more than one mountain are more complicated because there are edge effects to summate. It turns out that in the limit of small ², the output cell responds stably to q particular mountains with 2max bounded from above by µ ¶ (q − 1) Q 1− . 2max < R 2(R − 1) The larger the value of 2max , the fewer mountains will excite the output cell. Note that for q = 1, this expression reduces to 2max < Q/R. For the results shown in Figure 3, there were larger numbers of codon cells (N), mountains (Q), and mountains in which an individual codon cell is involved (R). The dependence of the number of mountains exciting the output cell on the value of 2max can be seen clearly. 5.4 Learning Rule Dynamics. The analysis presented so far provides necessary conditions under which desirable stable states exist but not sufficient conditions that such states are reachable. To answer this question, an analysis of the dynamic behavior of the learning rule is needed. Equations 4.1 and 4.2 define a recursive stochastic estimation procedure for calculating the weights and threshold. Such systems are studied by control engineers, for whom questions of convergence are important. Using the results of Ljung (1975), it is possible to show that, provided that the Heaviside transfer function of the output cell is approximated by a steep sigmoid, the learning rule’s dynamic behavior is equivalent to that of a certain continuous nonlinear dynamical system. The values of weights and threshold reachable by the learning rule are the limit points (if any) of that dynamical system, with probability 1. Further analysis of this equivalent dynamical system, involving demonstrating, for example, applicability of the Cohen-Grossberg theorem (Cohen & Grossberg, 1983) to this case, is underway. 6 More Than One Output Cell: Competitive Learning Marr only ever considered the case of a single output cell, although he did discuss in general terms the need to have several output cells that could learn to distinguish several mountains. This has been discussed by a number of authors under the banner of competitive learning (von der Malsburg, 1973; Grossberg, 1976; Rumelhart & Zipser, 1985). 6.1 Single-Winner Competitive Learning. In one simple case, discussed in particular by Rumelhart and Zipser (1985), each input pattern is to produce activity in a single output cell. In this way, different output cells learn to respond to the different types of input pattern. We considered the case where there are as many output cells as mountains, the intention being that
Marr’s Theory of the Neocortex as a Self-Organizing Neural Network
Percentage
0.6; 0.60, 0.00
0.7; 0.70, 0.00 100
100
100
80
80
80
80
60
60
60
60
40
40
40
40
20
20
20
20
0
0
0
0
Percentage
1.1; 0.51, 0.05
1.2; 0.48, 0.04
1.3; 0.45, 0.05
100
100
100
100
80
80
80
80
60
60
60
60
40
40
40
40
20
20
20
20
0
0
0
0
1.4; 0.41, 0.04
Percentage
0.9; 0.51, 0.05
100
1; 0.50, 0.03
1.5; 0.38, 0.05
1.6; 0.32, 0.05
1.7; 0.31, 0.04
100
100
100
100
80
80
80
80
60
60
60
60
40
40
40
40
20
20
20
20
0
0
0
0
1.8; 0.23, 0.05
Percentage
0.8; 0.64, 0.15
1.9; 0.21, 0.04
2; 0.18, 0.04
2.1; 0.15, 0.03
100
100
100
100
80
80
80
80
60
60
60
60
40
40
40
40
20
20
20
20
0
2 4 6 8 10 Mountain number
0
2 4 6 8 10 Mountain number
0
2 4 6 8 10 Mountain number
923
0
2 4 6 8 10 Mountain number
Figure 3: A single-output network trained on 10 mountains. N = 250, Q = 10, R = 5, K = 110, 2max was varied from 0.6 to 2.1 in steps of 0.1. All other parameter values and labeling conventions as in Figure 2.
924
David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau
Mountain number
1
1.05 5
5
5
4
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
Mountain number
1.25
1.3
1.35
5
5
5
5
4
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
1.4
Mountain number
1.15
5
1.2
1.45
1.5
1.55
5
5
5
5
4
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
1.6
Mountain number
1.1
1.65
1.7
1.75
5
5
5
5
4
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
2 4 Output cell number
2 4 Output cell number
2 4 Output cell number
2 4 Output cell number
Marr’s Theory of the Neocortex as a Self-Organizing Neural Network
925
each output cell would specialize onto the patterns from a different mountain. The activations {Aj : j = 1, 2, 3, . . . , N} in each of the Q output cells that result from presentation of an input pattern are calculated as Aj = (1/K)6i Wij Ci − 2j . The cell with the largest activation is then chosen to be active. In Figure 4, competitive learning is illustrated for 90 codon cells responding to patterns from Q = 5 mountains. This is the multiple output cell version of the single-cell system illustrated in Figure 2. The value of 2max has to be set appropriately because low values lead to some output cells responding to patterns from more than one mountain. The competitive learning scheme used by Rumelhart and Zipser (1985), among others, is very similar, with the following weight and threshold update rules, given here for a single output cell: 1Wi = α(−KWi + Wmax Ci )O 12 = α(−2 + 2max ).
(6.1) (6.2)
As before, α and 2max are adjustable parameters, and Wmax is set at 1. The two sets of competitive learning rules (equations 4.1, 4.2, 6.1, and 6.2) differ in having the roles of Ci and O interchanged. This scheme was applied to the problem used for Figure 4, using exactly the same set of parameters, as shown in Figure 5. 6.2 Distributed Outputs. Another possible use of a multiple output cell system is that all events from the same class of input pattern are to produce the same pattern of activity, distributed over the cells of the output layer. This
Figure 4: Facing page. Training a competitive network using the rule developed in this article. All parameter values as for Figure 2 except that there are Q = 5 output neurons and each network was trained for 2000 steps per run. To form each plot, after each run the percentage of times that patterns from a given mountain excited a given output cell was recorded in the appropriate square of a Q × Q, mountain × output cell, matrix. After the 100 different runs made for each value of 2max , the columns of each matrix were permuted to place the column containing the highest individual entry in position 1, the column with the next highest entry in position 2, and so on. Rows were then permuted to bring the row containing the highest entry to position 1, the next highest to position 2, and so on. Each plot is the summation of the results in the 100 matrices so formed. The color of each square in a plot indicates the mean activity on a gray scale, with pure white indicating 100 percent. Each horizontal bar drawn in the square is two standard deviations in length, being scaled by the width of each square that represents unity. The value of 2max used is shown above each plot.
926
David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau
means that patterns from different mountains may excite the same output cell. However, examination of the single runs used to generate Figures 2 and 3 revealed that the mountains that excite the same output cell lie adjacent to one another. This is not surprising since adjacent mountains share many codon cells, and therefore input patterns from adjacent mountains will share many weight values. This property of the network will prevent a welldistributed pattern of activity from being produced among the output cells. We carried out another set of simulations with a single output cell, in which codon cells were assigned to mountains at random rather than according to their position around the circle. The number of codon cells shared between events from mountains now varies randomly, about a mean value of M2 /N. We repeated the runs shown in Figures 2 and 3 and found that in this case too, the number of mountains exciting the output cell can be varied by adjusting the maximum threshold. When more than one mountain excites the output cell, the codon cells of these mountains are no longer correlated. When the system is given multiple output cells, each class of inputs will excite a different population of output cells, now randomly distributed. This procedure carries out more than just a simple recoding of input information since all the input patterns in a given class, each of which excites a different set of codon cells, are mapped onto the same output pattern. The theory and results for this case will be presented in a subsequent publication. 7 Competition and Competitive Learning There is a straightforward relation between the two competitive learning rules discussed that we shall formulate in terms of the fundamentals of competition. In a competitive learning algorithm, there has to be something for which the elements of the network compete (Prestige & Willshaw, 1975). Usually the competition is brought about by maintaining the total amount of synaptic weight per cell at a constant level. A basic distinction between competitive learning algorithms is that in some cases there is conservation of the sum of the weights of the synapses from each input cell; in other cases, the sum of the weights from each output cell is conserved. To clarify this distinction, we use the formulation of van der Malsburg and Willshaw (1981), developed in a neurobiological context. They separated out the activity-dependent change to be applied to the weights from that ensuring conservation of the summed weights. For input activity Ii and output activity Oj at synapse (i, j) with current weight value Wij , the raw weight change has the familiar Hebbian form (Hebb, 1949), δWij = αIi Oj , where α is a small positive constant. The weights are then normalized to keep their sum (calculated over some
Marr’s Theory of the Neocortex as a Self-Organizing Neural Network
Mountain number
1
1.05 5
5
5
4
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
Mountain number
1.25
1.3
1.35
5
5
5
5
4
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
1.4
Mountain number
1.15
5
1.2
1.45
1.5
1.55
5
5
5
5
4
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
1.6
Mountain number
1.1
1.65
1.7
1.75
5
5
5
5
4
4
4
4
3
3
3
3
2
2
2
2
1
1
1
1
2 4 Output cell number
2 4 Output cell number
2 4 Output cell number
927
2 4 Output cell number
Figure 5: Training a competitive network using the learning rule due to Rumelhart and Zipser (1985). All parameter values as for Figure 4.
928
David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau
as-yet unspecified population of synapses) at a constant level. The new weight Wij∗ is then Wij∗ = (Wij + δWij )
6Wij . 6(Wij + δWij )
Substituting for δWij and defining 1Wij ≡ Wij∗ − Wij , for α ¿ 1, µ ¶ Wij 6Ii Oj . 1Wij = α Ii Oj − 6Wij
(7.1)
7.1 Postsynaptic Sum Rule. In equation 7.1, take the sum over the synapses to each output cell, and maintain this sum at unity. With K active input cells per pattern, 6i Ii Oj = KOj , and therefore 1Wij = α(−KWij + Ii )Oj . This rule was used by Grossberg (1976) in his competitive learning rule, INSTAR. Rumelhart and Zipser (1985) used it in a later paper, with 2max = 0; variable threshold schemes are also known in neurobiological applications (Bienenstock, Cooper, & Munro, 1982). 7.2 Presynaptic Sum Rule. Alternatively, take the sum over the synapses on each input cell, and maintain this sum at unity. Since only one output cell is ever active, with unit magnitude, 6j Ii Oj = Ii , and the presynaptic rule examined in this article is obtained, 1Wij = α(−Wij + Oj )Ii . This update rule is used in the OUTSTAR network due to Grossberg (1976), which can be thought of as INSTAR being run backward, the roles of inputs and outputs being interchanged. In an OUTSTAR network, only one unit is active per input pattern, so that the update rule becomes equivalent to the delta rule (Widrow & Hoff, 1960). Successful operation of the presynaptic rule in competitive learning requires appropriate setting of the parameter 2max . The problem is to make an inactive output unit respond before the values of its weights and its threshold have all shrunk to zero. Suppose that after training, one of the units fails to respond to any event. Then one of the other output units will respond to events from two mountains. As a result, each weight to this unit will have value 2/R and its threshold will be 22max /Q. In order to activate the inactive output unit, j, the sum (1/K)6i Wij Ci − 2j for it must exceed that for the active unit. This condition is 2/R − 2max (2/Q) < 0 2max > Q/R.
(7.2)
Marr’s Theory of the Neocortex as a Self-Organizing Neural Network
929
Therefore, provided that 2max exceeds a certain value, successful competitive learning will result. This value is that which in the single output cell case ensures that the cell is activated by the events from a single mountain only (see equation 5.3; compare Figures 2 and 4). 7.3 Comparison of Rules. The postsynaptic sum rule has been cited widely in the literature as the archetypal competitive learning rule. However, as Rumelhart and Zipser (1985), among others, have pointed out, it has a number of shortcomings. The form of this conservation rule ensures that the weights of an output unit are changed independently of the weights to all other output units. This has two important consequences. First, the weights of inactive units never get modified, and so these units do not take part in the competition at all. Rumelhart and Zipser (1985) described two modifications to the rule to deal with this problem—one called “leaky learning” and one involving a threshold placed on each output unit. Second, the algorithm may approach its final state only very slowly. If initially an output unit responds very infrequently, the changes in its weights will be accumulated gradually, and it may be some time before they are enough to make the unit a frequent winner. The result of these two effects is that initially one or two output cells come to respond to all the input patterns (see Figure 7; 250 steps). Dominance then shifts abruptly between cells (500, 750 steps) until a stable configuration results. The development of the mapping according to the presynaptic rule is much smoother (see Figure 6). Intuitively, the presynaptic rule seems more competitive, since at every learning step the weights of all output units are changed. For successful operation, the correct value of the parameter 2max has to be chosen. A different distinction between weight normalization rules has been drawn by Goodhill and Barrow (1994) and Miller and MacKay (1994), who distinguish between subtractive and divisive normalization. Both the presynaptic and postsynaptic rules discussed here are of the divisive type; in a subtractive rule, the value of the weight change 1Wij does not depend on Wij explicitly. The presynaptic-postsynaptic distinction can be used as a basis of comparison between the topology-preserving mapping (Kohonen, 1982) and elastic net (Durbin & Willshaw, 1987) approaches to combinatorial optimization. Kohonen’s method uses a weight update rule of the postsynaptic type; in the elastic net, the sum of all the attractive forces applied between each fixed (input) point and the mobile points on the elastic net is kept constant—a type of presynaptic rule. Both presynaptic and postsynaptic rules are used extensively in models for the development of connections in the nervous system. Both have natural interpretations in terms of the conservation of some type of resource available to the appropriate cells. Which rule is used depends on which is more suitable computationally and which fits better to the experimental results. For the development of the ordered retinotectal projection in
930
David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau
Mountain number
1; 0.76, 0.73
250; 0.79, 0.59
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
Mountain number
750; 1.32, 0.38
1000; 1.50, 0.28
1250; 0.49, 0.07
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
1500; 0.41, 0.03
Mountain number
500; 1.08, 0.51
1750; 0.40, 0.05
2000; 0.40, 0.04
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
1 2 3 4 5 Output cell number
1 2 3 4 5 Output cell number
1 2 3 4 5 Output cell number
Figure 6: Single run of a “presynaptic” competitive network. Snapshots of a single run of the type used to generate Figure 4. 2max = 2.0; all other parameter values as in Figure 4. The figures above each individual plot specify how many steps in the run had elapsed when the snapshot was taken and the mean and standard deviation of the threshold 2 (calculated over the five output units) at that point.
Marr’s Theory of the Neocortex as a Self-Organizing Neural Network
Mountain number
1; 0.76, 0.73
250; 1.59, 0.18 5
5
4
4
4
3
3
3
2
2
2
1
1
1
Mountain number
1000; 2.00, 0.00
1250; 2.00, 0.00
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
1500; 2.00, 0.00
Mountain number
500; 1.88, 0.07
5
750; 1.98, 0.02
1750; 2.00, 0.00
2000; 2.00, 0.00
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
1 2 3 4 5 Output cell number
1 2 3 4 5 Output cell number
931
1 2 3 4 5 Output cell number
Figure 7: Single run of a “postsynaptic” competitive network. The run shown in Figure 6 was repeated using the algorithm of Rumelhart and Zipser (1985), with identical parameter values and sequences of patterns chosen during training and testing.
932
David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau
vertebrates, rules of both types have been used in separate models (Willshaw & von der Malsburg, 1976, 1979). The model due to Swindale (1980) on the development of ocular dominance stripes has a fixed number of receptors per postsynaptic cell, whereas in the models due to Goodhill and Willshaw (1990) and to Miller, Keller, and Stryker (1989), the total amount of presynaptic strength is conserved. For the elimination of synapses in the developing neuromuscular system, there are models incorporating either a presynaptic sum rule (Willshaw, 1981) or a postsynaptic rule (Gouz´e, Lasry, & Changeux, 1983). Recent theoretical work in this neurobiological application has emphasized the need, on biological grounds, for both rules to act simultaneously (Bennett & Robinson, 1989; Rasmussen & Willshaw, 1993). 8 Discussion Marr’s algorithm is a procedure that learns the discriminations among input patterns rather than one that learns to classify; it computes Probability(Output|Input) rather than Probability(Input|Output). Viewed as a pattern recognition algorithm, it is distinctive in that it works with unlabeled data (i.e., the class membership of each input pattern is unknown), and information about the absence of features in the presented pattern is not exploited. 8.1 Labeled versus Unlabeled Data. As an algorithm working on unlabeled data, it is of the unsupervised type. Because of its close relation to the “postsynaptic” algorithm due to Rumelhart and Zipser (1985), the statistical method that it most resembles is the K-means clustering algorithm (Moody & Darken, 1989). If the data were labeled, maximum likelihood schemes could be used, as discussed in section 3.1. Minsky and Papert (1969) showed that it is possible to construct a single-layer neural network that computes maximum likelihood. Given statistical independence among codon cell activities, it can be shown that computing maximum likelihood corresponds to carrying out the linear decision policy of the type that underlies many computations in neural networks. It turns out that the weight values are functions of the probability that the output is active when a codon cell is active as well as the probability that an output is active when a codon cell is inactive, which the presynaptic rule developed here does not exploit. The expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) is applicable to both labeled and unlabeled data. It is worth noting that K-means can be regarded as a type of EM optimization of a Gaussian mixture model (Bishop, 1995). 8.2 Other Approaches. Marr concentrates on the problem of estimating the probability P(Output = 1|E) that the output cell should be active on presentation of a pattern of input activity representing the event E. Within this framework of estimation, where information about the states
Marr’s Theory of the Neocortex as a Self-Organizing Neural Network
933
of individual input cells is available, there are four pieces of information that could be used: P(Output = 1|Input = 1), P(Output = 1|Input = 0), P(Output = 0|Input = 1), and P(Output = 0|Input = 0). In Marr’s scheme, only the first of these pieces of information is recorded. One reason for doing this may have been that if each input cell is intended to represent a neuron, recording additional information would imply that a neuron has more than one information-bearing state. One can construct more complex neural models by, for example, duplicating input units so that one member of the pair would look after the active state of an input unit and the other member the inactive state. Willshaw (1972) did this in his Inductive Net, and there are interesting mathematical consequences. However, analysis of this radically different type of system would take us too far away from the focus of this article. Even if a scheme is adopted that uses information about these four quantities, the value of P(Output = 1|E) is still not fully determined. In the broader context, this problem could be solved by supposing that the information about the relative frequencies of occurrence of combinations of active input cells can be exploited (Minsky & Papert, 1969). Marr may have been thinking of this possibility because in his complete model, the codon cells are intended to act as hidden units that respond to the salient features of the input patterns rather than to the raw input data. However, his analysis was restricted to the simpler model without hidden units. 8.3 Structured Input Spaces: Mountains. Marr looked at one particular way of imposing structure within a set of input patterns, to define a set of “mountains.” He examined the special case of designing a network that can be trained to distinguish between the patterns from two mountains only, where his proposed scheme of using the equivalent of cerebellar climbing fibers to drive particular output cells does work. However, this is a special and somewhat trivial case, and our analysis generalizes to the cases of more than two mountains, where each codon cell is involved in more than two mountains, and more than one output cell. There are many ways of imposing structure on the input patterns. We have restricted ourselves to looking at Marr’s original prescription and the most straightforward generalizations of it. 9 Conclusions Our most important result is that a neurobiologically plausible implementation of the spatial recognizer effect due to David Marr has been found and can be made to function extremely reliably. The learning rule is plausible for two reasons: It is specified in terms of simple rules for the adjustment of weights on an incremental basis using only information that is locally available; and for its operation, it does not presuppose the existence of intricate circuitry, such as the equivalent of cerebellar climbing fibers, for which there
934
David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau
is no real evidence. Our analysis confirms our simulation results that the number of mountains that will fire the output cell can be set by adjusting the maximum value of the threshold. The more mountains that are required, the lower the maximum threshold must be. The competitive learning algorithm that we developed for the case of multiple output cells has strong relationships with existing competitive learning rules and is better known in computational neuroscience modeling circles than it is in the field of neural networks. In a forthcoming publication we shall examine more closely the relationship between presynaptic and postsynaptic competitive learning rules, and we shall also show how the learning algorithms that we have developed can be applied to classification tasks where other types of structure within ensembles of input patterns are considered. Two cases of interest are when the number of input units active per event, K, varies over the ensemble of events (mentioned briefly in section 6.2) and when the events from the different mountains can occur with different probabilities.
Acknowledgments We thank Peter Dayan, Martin Simmen, and And Turken for very helpful discussions and our referees for their critical comments. Financial support is acknowledged to the MRC (Programme Grant to D.W.) and to SERC (M.Sc. studentship for S.G.). Facilities were provided by the University of Edinburgh.
References Bennett, M. R., & Robinson, J. (1989). Growth and elimination of nerve terminals at synaptic sites during polyneuronal innervation of muscle cells: A trophic hypothesis. Proceedings of the Royal Society of London B, 235, 299–320. Bienenstock, E., Cooper, L., & Munro, P. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 32–48. Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man and Cybernetics, 13, 815–826. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 689–691.
Marr’s Theory of the Neocortex as a Self-Organizing Neural Network
935
Gingell, S. (1994). Extension of Lau’s 1992 simulation of Marr’s theory of the neocortex. Unpublished master’s thesis, University of Edinburgh. Goodhill, G. J., & Barrow, H. G. (1994). The role of weight normalization in competitive learning. Neural Computation, 6, 255–269. Goodhill, G. J., & Willshaw, D. J. (1990). Application of the elastic net algorithm to the formation of ocular dominance stripes. Network, 1, 41–59. Gouz´e, J. L., Lasry, J. M., & Changeux, J. P. (1983). Selective stabilization of muscle innervation during development: A mathematical model. Biological Cybernetics, 46, 207–215. Grossberg, S. (1976). Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors. Biological Cybernetics, 23, 121–134. Hebb, D. (1949). The organization of behavior. New York: Wiley. Kingman, J. F. C., & Taylor, S. J. (1966). Introduction to measure and probability. Cambridge: Cambridge University Press. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Lau, S. L. (1992). Analysis of Marr’s model of the neocortex. Unpublished master’s thesis, University of Edinburgh. Ljung, L. (1975). Theorems for the asymptotic analysis of recursive stochastic algorithms (Tech. Rep. No. 7522). Lund, Sweden: Lund Institute of Technology. Marr, D. (1969). A theory of cerebellar cortex. Journal of Physiology (London), 202, 437–470. Marr, D. (1970). A theory for cerebral neocortex. Proceedings of the Royal Society of London B, 176, 161–234. Marr, D. (1971). Simple memory: A theory for archicortex. Philosophical Transactions of the Royal Society of London B, 262, 23–81. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Computation, 6, 100–126. Minsky, M., & Papert, S. (1969). Perceptrons. Cambridge, MA: MIT Press. Moody, J., & Darken, C. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1, 281–294. Prestige, M. C., & Willshaw, D. J. (1975). On a role for competition in the formation of patterned neural connexions. Proceedings of the Royal Society of London B, 190, 77–98. Rasmussen, C. E., & Willshaw, D. J. (1993). Presynaptic and postsynaptic competition in models for the development of neuromuscular connections. Biological Cybernetics, 68, 409–419. Rumelhart, D., & Zipser, D. (1985). Feature discovery by competitive learning. Cognitive Science, 9, 75–112. Sibson, R. (1969). Information radius. Wahrscheinlichkeitstheorie, 14, 149–160. Swindale, N. V. (1980). A model for the formation of ocular dominance stripes. Proceedings of the Royal Society of London B, 208, 243–264. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14, 85–100.
936
David Willshaw, John Hallam, Sarah Gingell, and Soo Leng Lau
von der Malsburg, C., & Willshaw, D. J. (1981). Differential equations for the development of topological nerve fibre projections. SIAM-AMS Proceedings, 13, 39–47. Widrow, B., & Hoff, M. (1960). Adaptive switching circuits. In 1960 IRE WESCON Convention Record (Vol. 4, pp. 96–104). New York: IRE. Willshaw, D. J. (1972). A simple network capable of inductive generalisation. Proceedings of the Royal Society of London B, 182, 233–247. Willshaw, D. J. (1981). The establishment and the subsequent elimination of polyneural innervation of developing muscle: theoretical considerations. Proceedings of the Royal Society of London B, 212, 233–252. Willshaw, D. J., & Buckingham, J. T. (1990). An assessment of Marr’s theory of the hippocampus as a temporary memory store. Proceedings of the Royal Society of London B, 329, 205–215. Willshaw, D. J., & von der Malsburg, C. (1976). How patterned neural connections can be set up by self-organization. Proceedings of the Royal Society of London B, 194, 431–445. Willshaw, D. J., & von der Malsburg, C. (1979). A marker induction mechanism for the establishment of ordered neural mappings: Its application to the retinotectal problem. Philosophical Transactions of the Royal Society of London B, 287, 203–243.
Received November 10, 1995; accepted July 11, 1996.
ARTICLE
Communicated by Terrence Sanger
Hybrid Learning of Mapping and Its Jacobian in Multilayer Neural Networks Jeong-Woo Lee Machine Control Laboratory, Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology, Taejeon, Republic of Korea
Jun-Ho Oh Department of Mechanical Engineering, KAIST, Taejeon, Republic of Korea
There are some learning problems for which a priori information, such as the Jacobian of mapping, is available in addition to input-output examples. This kind of information can be beneficial in neural network learning if it can be embedded into the network. This article is concerned with the method for learning the mapping and available Jacobian simultaneously. The basic idea is to minimize the cost function, which is composed of the mapping error and the Jacobian error. Prior to developing the Jacobian learning rule, we develop an explicit and general method for computing the Jacobian of the neural network using the parameters computed in error backpropagation. Then the Jacobian learning rule is derived. The method is simpler and more general than tangent propagation (Simard, Victorri, Le Cun, & Denker, 1992). Through hybridization of the error backpropagation and the Jacobian learning, the hybrid learning algorithm is presented. The method shows good performance in accelerating the learning speed and improving generalization. And through computer experiments, it is shown that using the Jacobian synthesized from noise-corrupted data can accelerate learning speed. 1 Introduction Multilayer feedforward neural networks can be trained to learn a continuous function with desired accuracy (Cybenko, 1989; Hornik, Stinchcombe, & White, 1989). Since the publication of Rumelhart, Hinton, & Williams (1986), many improved learning algorithms have been reported in the literatures, and various learning methods have been reviewed (Ergezinger & Thomsen, 1995; Jin, Nikiforuk, & Gupta, 1995; Baba, Mogami, Shiraishi, & Yamashita, 1995). The multilayer neural network, with various squashing functions, can approximate even first-order derivatives (Gallant & White, 1992; Hornik et al., 1990; Cardaliaguet & Euvrard, 1992). The method of learning first-order derivatives has not been clearly explored, however, and among the various types of learning methods, the error backpropagation Neural Computation 9, 937–958 (1997)
c 1997 Massachusetts Institute of Technology °
938
Jeong-Woo Lee and Jun-Ho Oh
learning rule (delta learning rule) and its variants are widely accepted for their simplicity and relatively good performance in training multilayer neural networks. Consider the following mapping: 9 : X 7→ Y
X ∈ RK and Y ∈ RJ .
(1.1)
Assume that we are to train the neural network with a sufficient number of mapping examples taken from 9. Suppose further that we have additional a priori information on the Jacobian of 9 even for a small number of the mapping examples: 8 : X 7→
∂Y ∂X
X ∈ RK and
∂Y ∂X
∈ RK×J .
(1.2)
In other words, suppose that we have some Jacobian at some data points, which is very important information on 9. To see the effect of Jacobian learning on neural network properties, consider single input–single output case. Suppose that sufficient learning data (xi , 9(xi )) and validation data (xi + δx, 9(xi + δx)) are provided (i = 1, 2, . . . , n). Then the learning error(JL ) and the validation error(JV ) can be expressed as: JL =
n 1X ˆ i ))2 (9(xi ) − 9(x n i=1
(1.3)
JV =
n 1X ˆ i + δx))2 (9(xi + δx) − 9(x n i=1
(1.4)
ˆ where 9(x) denotes neural network realization of 9(x). In this case, if the neural network learned the plant with best generalization, then it is natural that we expect JL ' JV . Therefore, although various measures are suggested and investigated (Niyogi & Girosi, 1996; Freeman & Saad, 1995), as candidates for generalization measures, we suggest using JG = n1 (JL − JV )2 , which can be expanded as n 1X ˆ i ))2 − (9(xi + δx) − 9(x ˆ i + δx))2 )2 ((9(xi ) − 9(x n i=1 !2 à n ˆ i )2 ¯¯ 1X ∂(9(x) − 9(x ' δx ¯ x=xi n i=1 ∂x à !2 n ¯ ˆ 4X ∂ 9(x) ∂9(x) ¯¯ ¯ 2 ˆ i )) = (9(xi ) − 9(x − δx2 . ¯ ¯ {z } n i=1 | ∂x x=xi ∂x x=xi term−A | {z }
JG =
term−B
(1.5)
Mapping and Its Jacobian
939
In this equation, term-A is related to the mapping error and term-B to the Jacobian error. If term-A is small but term-B remains large after learning, we cannot expect a good generalization because JG remains large. Although this analysis is rough, we can expect that better generalization can be obtained through additional Jacobian learning, which minimizes term-B in equation 1.5. Furthermore, because Jacobian represents the approximate inclination of mapping near a given point, we can expect that Jacobian learning can have the effect of learning several examples near the point, and thus we can expect fast learning speed through Jacobian learning. This article focuses on the following subjects: • A method of computing the Jacobian in a multilayer neural network. Computing the Jacobian in an efficient and explicit way is important because it is used to derive the Jacobian learning rule. • A method of learning the Jacobian in multilayer neural networks to use the available Jacobian for neural network learning. • An investigation of the effect of learning the Jacobian through computer experiments. In section 2, we derive the method of computing the Jacobian in terms of neural network parameters, which becomes very efficient if we use the error backpropagation learning rule. The derived explicit equation is used to derive the Jacobian learning rule. In section 3, we derive the Jacobian learning rule that minimizes the Jacobian error between the Jacobian of the neural network and the Jacobian provided. In section 4, we develop a hybrid learning method that minimizes the cost of the mapping error and the Jacobian error. In section 5, we present some computer experiments to demonstrate the effect of Jacobian learning. And finally, in section 6, we draw some conclusions and discuss some related topics. 2 Computing the Jacobian of a Multilayer Neural Network In this section, we derive the method for computing the Jacobian of a multilayer neural network while learning the mapping by error backpropagation. There has been some evidence that the Jacobian of a multilayer neural network can be calculated using the delta and other related quantities (Jordan & Rumelhart, 1992; White & Jordan, 1992; Iwata & Kitamura, 1994). We will summarize and arrange the method in a formal and explicit way. The equations for computing the Jacobian will be used for deriving the Jacobian learning rule. We begin by explaining the notations we use throughout the article. See Figure 1 for a simplified structure of the neural network. Let ˆ and Y denote the neural network input vector, the neural network X, Y, output vector, and the output vector of the underlying mapping from X,
940
Jeong-Woo Lee and Jun-Ho Oh
Figure 1: Simplified structure of the multilayer neural network. All layers with two nodes are assumed.
respectively. We will use the following notations: X ≡ [x1 , x2 , . . . , xK ]t ¤t £ Y ≡ y1 , y2 , . . . , yJ £ ¤t Yˆ ≡ yˆ 1 , yˆ 2 , . . . , yˆ J
(2.1)
yˆ i =
(2.4)
Op ≡ p oj p
=
fj ≡
ϕiL ( fiL h
+
θiL )
(2.2) (2.3)
≡ oLi i p t
p p o1 , o2 , . . . , ohp p p p ϕj ( fj + θj )
X
p−1
ok
(2.5) (2.6)
p
wjk
(2.7)
k
ˆ respectively. (L − where K and J denote the dimension of X and Y (or Y), 1), hp , ϕji , fji , θji , and oji denote the number of hidden layers, number of nodes in the pth layer, the sigmoidal function of node j in layer i, input to the sigmoidal function, and bias and output of node j in layer i, respectively. Superscript t means the transpose of a vector or a matrix. Let 2k and W k denote the bias input vector at the kth layer and the weight matrix in the kth layer, respectively. (The 0th layer denotes the input layer and the Lth layer denotes the output layer in the L layered neural network.) h it 2k ≡ θ1k , θ2k , . . . , θhkk Wk ≡
wk11 wk21 .. . wkhi 1
wk12 wk22 .. . ···
wk13 wk23 .. . ···
(2.8) ··· ··· .. . ···
wk1hi−1 wk2hi−1 .. . wkhi hi−1
(2.9)
Mapping and Its Jacobian
941
where wjik denotes the weight from neuron i in the (k − 1)th layer to neuron j in the kth layer. The weight in the first layer implicitly means the weights from the input to the first hidden layer. Define the input vector to the hth layer node as h i Fh ≡ f1h , f2h , . . . , fhhh = W h Oh−1 ,
(2.10)
and define the normalized error as e¯q ≡ colq (IJ×J )
(2.11)
where colq (A) denotes the qth column vector taken from the matrix A, and IJ×J denotes the J × J identity matrix. Define the normalized delta as "
0
ϕji e¯q 0 P ϕji m δ¯q(i+1)m wi+1 mj h it i i1 i2 ih ¯ q ≡ δ¯q δ¯q · · · δ¯q i 1
ij δ¯q
=
when layer i is output when layer i is hidden
(2.12) (2.13)
ij where, in δ¯q , i and j denote the layer number and the node number, respecij tively. The subscript q in δ¯q denotes δ i (defined in equation 4.5) calculated j
ij by the delta learning rule when the error is set to e¯q . The terms e¯q and δ¯q are introduced to take the multiple input, multiple output (MIMO) system into account. Now we can state the first theorem, which provides an easy way of computing Jacobian from neural network parameters.
Theorem 1 Computing the Jacobian of a Multilayer Neural Network. Let ˆ defined in equation 2.3, denote the input vector X, defined in equation 2.1, and Y, and the output vector of a multilayer neural network. Then the following relation holds, ∂ Yˆ = ∂X
¯ 1 )t W 1 (1 1 ¯ (112 )t W 1 .. . ¯ 1 )t W 1 (1 J
(2.14)
where the notations are given above. (See the appendix of this article for the proof.) This theorem is known and used by some authors (Jordan & Rumelhart, 1992; White & Jordan, 1992; Iwata & Kitamura, 1994). Although they address the single-output case, their method can easily be extended to the MIMO
942
Jeong-Woo Lee and Jun-Ho Oh
case. We summarize it to express the formula in an explicit way, because an explicit formula is indispensable for deriving the Jacobian learning rule. It is clear from equations 2.12 and 2.14 that the Jacobian matrix should be computed row by row by setting the corresponding normalized error, which involves setting corresponding output to 1 at a time. If we compute the Jacobian under the online computation of the error backpropagation 0 learning rule, ϕji and ϕji are computed to adjust the weights. Therefore, it is not so burdensome to compute the Jacobian under the online learning by error backpropagation. If the mapping to be learned is the multipleˆ which can be input, single-output (MISO), then δ¯ is identical with δ/y − y, computed at the final stage of error learning. In this case, the theorem can be stated in a simpler way: ∂ yˆ = (11 )t W 1 . ∂X
(2.15)
(In this case, since the output is one, the subscript in 1 is omitted.) The computation of the Jacobian can be simplified to one multiplication of the ¯ 1. weight matrix W 1 and the vector 1 3 Derivation of the Jacobian Learning Rule Simard et al. (1992) showed that for MISO cases, gradient learning combined with error learning can improve learning speed. Their method requires computing tangent propagation. The method cannot handle missing data in Jacobian, because it requires gradient forward propagation and more computation time. In this section, we derive an efficient Jacobian learning rule that is applicable to MIMO cases and does not require gradient forward propagation in MISO cases. The learning rule can be derived by minimizing the following cost function: Jg ≡
1X (gkl − gˆ kl )2 2 kl
where
gkl =
∂yl ∂xk
gˆ kl =
∂ yˆ l . (3.1) ∂xk
To express the Jacobian learning rule in a compact recursive form, define the following quantities: 1 when i = 1 ipq w1lp when i = 2 (3.2) µkl = P (i−1)pq i−1 (i−2)0 wlr ϕr when i ≥ 3 r µlr X ik (i+1)m i+1 ¯ dq = δq wmk (3.3) m
In equations 3.2 and 3.3, the subscripts p and q denote that we are working to learn ∂yq /∂xp .
Mapping and Its Jacobian
943
To derive the Jacobian learning rule, we use the steepest descent method, which is also used for backpropagation: 1 ∂(gpq − gˆpq )2 ηg = − 1g wm kl 2 ∂wm kl = ηg
∂ gˆpq (gpq − gˆpq ) . ∂wm kl
(3.4)
m We need only to compute ∂ gˆpq /∂wm kl to derive 1 g wkl . For the case of m = 1 (first layer),
∂ gˆpq ∂w1kp
P
=
∂(
¯1i 1 i δq wip ) ∂w1kp
= δ¯q1k +
∂ δ¯q1k ∂w1kp
w1kp
00 = δ¯q1k + xp ϕk1 w1kp d1k q .
(3.5)
For the case of m = 2, similarly, ∂ gˆpq ∂w2kl
P
= =
¯1i 1 i δq wip ) ∂w2kl P 1 10 P 2r 2 ¯ ∂( i wip ϕi r δq wri ) ∂w2kl ∂(
0 = w1lp ϕl1 δ¯q2k + 0 = w1lp ϕl1 δ¯q2k +
X
0
ϕi1 w1ip
i 1 200 2k ol ϕk dq
∂ δ¯q2k
w2 ∂w2kl ki X 0 ϕi1 w1ip w2ki .
(3.6)
X 0 2 2 (ϕm wmi w3km ).
(3.7)
i
Similarly, for the case of m = 3, ∂ gˆpq ∂w3kl
0 = δ¯q3k ϕl2
X i 00
0
ϕi1 w1ip w2li
+o2l ϕk3 d3k q
X i
0
ϕi1 w1ip
m
We can obtain the weight adaptation rule for the Jacobian learning by substituting equations 3.5 through 3.7 into equation 3.4. For the biases, a similar adaptation rule for Jacobian learning can be obtained. In fact, the Jacobian
944
Jeong-Woo Lee and Jun-Ho Oh
leaning rule, including m = 1, 2, and 3, can be generalized as follows: m−1 m00 mk (m+1)pq ¯mk m0 mpq 1g wm ϕk dq µkl )(gpq − gˆpq ). kl = η g (δq ϕl µkl + ol | {z } | {z }
1g θkm
=
term1 m00 mk (m+1)pq ηg ϕk dq µkl (gpq
(3.8)
term2
− gˆpq ).
(3.9)
In equation 3.8, o0l means the inputs of neural network, xl . The derivation of the Jacobian learning rule for the general case in equations 3.8 and 3.9 can be found in the appendix. In equation 3.8, term1 represents the effect of the weight variation to the Jacobian of the neural network in forward direction. And term2 represents the effect of variation of δ¯ in the mth layer neuron. Because δ¯ affects the Jacobian of the neural network, term2 cannot be neglected. As can be seen in the above Jacobian learning rule, the ith layer of µ in equation 3.2 is computed using their value at the (i−1)th layer, so the direction of Jacobian learning is forward. The terms needed for computing the Jacobian, that is, δ¯ in equations 4.5 and 2.13, are computed in a backward direction. This is the reverse direction to that of error learning. 4 Hybrid Learning of Mapping and Its Jacobian If we set the cost function as follows, we can derive the weight adaptation rule, which minimizes the mapping error and the Jacobian error: J ≡ λe Je + λg Jg 1X Je ≡ (yk − yˆ k )2 . 2 k
where
(4.1) (4.2)
In equation 4.1, λe and λg denote the weighting coefficients for the mapping error and the Jacobian error, which may vary with iteration of learning. Weight correction, 1wm kl , to the direction that minimizes J is assumed to be decomposed as m m 1wm kl = λe 1e wkl + λ g 1 g wkl ,
(4.3)
where 1e wm kl denotes the weight correction to the direction that minimizes Je , and 1g wm kl denotes the weight correction to the direction that minimizes is calculated according to error backpropagation (Rumelhart et al., Jg . 1e wm kl 1986), 1e wm kl = −ηe
∂ Je ∂wm kl
= ηe δkm om−1 , l
(4.4)
Mapping and Its Jacobian
945
Figure 2: Direction of the weight correction. The two-dimensional weights are assumed.
where δk is defined as · δkm =
0
ek ϕkL 0 P ϕkm l δlm+1 wm+1 lk
when neuron k is output, ek = yk − yˆ k (4.5) when neuron k is hidden.
The following steps are required for the hybrid learning of the mapping and its Jacobian: 1. Compute neural network outputs(yˆj ) and mapping errors in the forward direction. 2. Compute δ¯m and 1e w in the backward direction. k
3. Compute Jacobian errors (Eg ) at the end of step 2. 4. Correct weights according to equation 4.3 while computing δ¯ijm and 1g w in the forward direction. The weight corrections λe 1e W and λg 1g W can conflict with each other. As shown in Figure 2, if the angle between λe 1e W and λg 1g W is relatively large and their magnitudes are approximately same, then the resultant weight correction, 1W, can be the direction that can minimize neither Je nor Jg . In this case, tuning the weighting coefficients λe and λg is recommended. If one is to weight the error, then it is better to increase λe relative to λg . But in many cases, the weighting coefficients can be set to constant because, after some iteration of learning, the two directions of the weight correction can approximately coincide with each other. If λe is very small compared with λg , then almost all learning is done for Jacobian learning. In this case, the learned neural network will have a function shape very similar to the underlying mapping, but the mapping error may be large. Clearly the tuning of the two weighting coefficients is important.
946
Jeong-Woo Lee and Jun-Ho Oh
5 Computer Experiments 5.1 Experiment 1: Comparison of Learning Speed. In this experiment, we choose static mapping with one relatively sharp corner as an underlying mapping. The mapping to be learned by the neural network is generated by µ
(x − 0.05)2 y = −13(x − 0.5) + 8 exp − 0.01 2
¶ .
(5.1)
The shape of the mapping to be learned is shown in Figure 3. The neural networks are constructed as follows: • One hidden layer with 60 neurons; bias inputs to hidden and output nodes. • Squashing function: Hyperbolic tangent for hidden layers, linear output node. • Error learning rate, ηe = 0.01; Jacobian learning rate, ηg = 0.001. • Number of iterations for learning: 2000 epochs. For each epoch, 80 samples are presented. • Initial weights: Random number between −1 and 1. • Learning examples: Chosen randomly within the range of −0.3 ≤ x ≤ 1.3. • Weighting coefficients(λe and λg ): Set to unity. No additional techniques are used to accelerate the learning speed. To compare the differences in the learning properties easily, we set the topology of the neural network and the learning rate identical for error backpropagation and hybrid learning. The only difference is whether Jacobian learning is added. The neural network mapping error(Je ) is shown in Figure 4. The top figure is the result of conventional error backpropagation, and the bottom figure is that of hybrid learning. The middle figure is the sum of the squared error(Je ) by assuming that the Jacobian is available on 50 percent of learning data. Careful investigation of the results leads to the following conclusions: • Even we take the computation time of hybrid learning (almost twice that of error backpropagation in this experiment) into consideration, learning is far more accelerated with the additional Jacobian learning. • The result of the hybrid learning improves as more Jacobian information is available. 5.2 Experiment 2: Use of the Jacobian Synthesized from Noise-Corrupted Data. In this experiment, we investigate the possibility of using the
Mapping and Its Jacobian
947
6
4
2
y
0
−2
−4
−6
−8
−10 −0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
x
Figure 3: Shape of the underlying mapping used in experiments 1 and 2.
Jacobian synthesized from the noise-corrupted data, a similar experiment with a real-world problem. In real data sets, the Jacobian may not be provided, and even the data set may be corrupted by measurement noise. Using the data set in experiment 1, 80 noise-corrupted samples are generated by the following equation: ¶ µ xi − 0.05)2 2 yi = −13(xi − 0.5) + 8 exp − 0.01 + 0.5 × randn i = 1, 2, . . . , 80 xi = −0.3 + 0.2i
i = 1, 2, . . . , 80
where randn denotes normal gaussian noise. The underlying mapping and noise-corrupted data are shown in Figure 5. To extract the rough Jacobian, we use the following simple formula: µ ¶ 1 yi − yi−1 yi+1 − yi = + . (5.2) gi 2 xi − xi−1 xi+1 − xi Because the Jacobian obtained in this way is very noisy, we use the following simple averaging filter: gi | f iltered =
gi−1 + 2gi + gi+1 . 4
(5.3)
Figure 6 shows the added noise (top), the synthesized Jacobian with equation 5.2 (middle panel,) and the filtered Jacobian (bottom panel). Because the
948
Jeong-Woo Lee and Jun-Ho Oh
Network Error(0%)
4
10
3
10
2
Sum−Squared Error
10
1
10
0
10
−1
10
−2
10
−3
10
0
200
400
600
800
1000 Epoch
1200
1400
1600
1800
2000
1400
1600
1800
2000
1400
1600
1800
2000
Network Error(50%)
4
10
3
10
2
Sum−Squared Error
10
1
10
0
10
−1
10
−2
10
−3
10
0
200
400
600
800
1000 Epoch
1200
Network Error(100%)
4
10
3
10
2
Sum−Squared Error
10
1
10
0
10
−1
10
−2
10
−3
10
0
200
400
600
800
1000 Epoch
1200
Figure 4: Neural network mapping error (Je ) during 2000 epochs of learning in experiment 1. Top: µe = 0.01, µg = 0. No Jacobian learning. Middle: µe = 0.01, µg = 0.001. Jacobian learning for 50 percent of the examples. Bottom: µe = 0.01, µg = 0.001. Jacobian learning for all examples.
Mapping and Its Jacobian
949
6
4
2
0
−2
−4
−6
−8
−10 −0.4
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Figure 5: Plot of the underlying mapping (smooth curve) and the noisecorrupted data set in experiment 2.
synthesized Jacobian is not the exact Jacobian of the underlying mapping, learning of filtered Jacobian at the fine-tuning stage is undesirable and may lead to an inferior model, so we tune the learning rate as follows:
0.001 decreased linearly ηg = from 0.001 to 0. 0
0.01 ηe = decreased linearly from 0.001 to 0.
0 < epoch ≤ 500 500 < epoch ≤ 1000 1000 < epoch ≤ 2000
(5.4)
0 < epoch ≤ 1000 1000 < epoch ≤ 2000.
(5.5)
All the other conditions except the learning rate are the same as in experiment 1. The neural network mapping error (Je ) is shown in Figure 7. Although we believe that hybrid learning requires two passes of weight updates, the result shows that the learning is much faster, by three to four times, compared with that of error backpropagation. This experiment indicates that the use of the filtered Jacobian synthesized from the raw data can be effective in accelerating learning; however, the structure of the filter should vary according to the characteristics of the given data set. 5.3 Experiment 3: Suppression of Overfitting. In this experiment, we investigate the ability to prevent overfitting by additional Jacobian learning.
950
Jeong-Woo Lee and Jun-Ho Oh (a) Data Noise 2 1 0 −1 −0.5
0 0.5 1 (b) Extracted Jacobian and Jacobian of underlying mapping
1.5
0 0.5 1 (c) Filtered extracted Jacobian and Jacobian of underlying mapping
1.5
100
0
−100 −0.5 100
0
−100 −0.5
0
0.5
1
1.5
Figure 6: Experiment 2. (a) Noise added to the underlying mapping. (b) Jacobian synthesized from the noise-corrupted data by using equation 5.2. The smooth curve in the figure represents the Jacobian of underlying mapping (see equation 5.1). (c) Filtered Jacobian shown in (b) with simple averaging filter by using equation 5.3. The smooth curves in the figure represent the Jacobian of the underlying mapping (see equation 5.1).
For this purpose, the experiment is designed to fit a straight line with a small number of examples. Experimental conditions were as follows: • For the neural network, there was one hidden layer with 100 neurons of hyperbolic tangent with bias and one linear output node with bias. The error learning rate was 0.0005, and the Jacobian learning rate was 0.0005. There were 20,000 iterations for learning. • Seven sample data and seven validation data, which were noise corrupted, were generated by the following equations: xi = 0.4i + rand
i = 1, 2, . . . , 7
yi = 0.5xi + rand
i = 1, 2, . . . , 7
where rand denotes the random number between −1 and 1. • Learning examples were shown in a random sequence.
Mapping and Its Jacobian
951 Network Error
4
10
Sum−Squared Error
3
10
2
10
1
10
0
100
200
300
400
500 Epoch
600
700
800
900
1000
700
800
900
1000
Network Error
4
10
Sum−Squared Error
3
10
2
10
1
10
0
100
200
300
400
500 Epoch
600
Figure 7: Neural network mapping error (Je ) in experiment 2. Top: Result of conventional error backpropagation. Bottom: Result of the proposed method by using the filtered Jacobian (see equation 5.3) synthesized from the noisecorrupted data.
• For each learning of the sample points in Jacobian learning, the noisecorrupted Jacobian was provided by the following equation: g = 0.5 + randn × 0.3 where randn denotes normal gaussian noise. • The weighting coefficients were set to unity.
(5.6)
952
Jeong-Woo Lee and Jun-Ho Oh
No other modifications were attempted so that we could see the effects of the Jacobian learning more clearly. As in the previous experiment, the same topology of the neural network was used, and the only difference was whether the Jacobian learning was added. The learning error(JL ) and validation error (JV ) versus learning epoch are shown in Figures 8 and 9. The result of neural network learning and the average of the squared distance between the neural network and underlying mapping (y = 0.5x) is also shown. The learning error (JL ) and validation error (JV ) are computed by summing the square of error at the points. The averages of JL and JV in the last 100 iterations for error learning are 0.51 and 4.41, respectively. Those for hybrid learning are 1.88 and 3.38, respectively. The difference between JL and JV in hybrid learning (1.50) is smaller than that in error learning (3.90). A comparison of the two sets of figures shows that the mean distance between the neural network and underlying mapping in error backpropagation is larger than that in hybrid learning. Although the learning error for learning samples in hybrid learning is larger than that in error backpropagation, there is better generalization. These observations support the hypothesis set out in section 1 regarding the effect of Jacobian learning on generalization. 6 Conclusions We have proposed an algorithm that learns the mapping and its Jacobian. For this purpose, we derived the explicit form of computing the Jacobian using the neural network parameters, and we derive the hybrid learning algorithm that minimizes the cost function consisting of mapping error and Jacobian error. The proposed Jacobian learning method is similar to that of the error backpropagation learning rule but in the opposite direction of calculation. Computer experiments proved that the proposed method is faster than conventional error backpropagation, because the proposed hybrid learning algorithm learns to minimize both the mapping error and the Jacobian error simultaneously. We show as well the possibility that using the rough Jacobian synthesized from noise-corrupted raw data set can accelerate learning speed and that Jacobian learning improves generalization through computer experiment. Appendix A.1 Proof of Theorem 1. Proof. From the definition of the feedforward neural network (at output layer), yˆj = ϕjL ( fjL + θjL )
Mapping and Its Jacobian
953
3
1
10
2.5
Mean Squared Distance
2
1.5
y
1
0.5
0
10
−1
10
0
-0.5
-1 -2
−2
-1
0
1
2
3
4
5
6
10
7
0
100
200
300
400 500 600 Learning Epoch
700
800
900
1000
100
200
300
400 500 600 Learning Epoch
700
800
900
1000
x
2
2
10
10
1
1
10
Learning Error
Validation Error
10
0
0
10
10
−1
10
0
−1
100
200
300
400 500 600 Learning Epoch
700
800
900
1000
10
0
Figure 8: Experiment 3: Error backpropagation learning. Upper left: Learned result of neural network. Dotted line: underlying mapping, o: validation data, *: sample learning data. Upper right: Mean squared distance between neural network and underlying mapping (−0.5 < x < 6.5). Lower left: Sum of squared error for learning data. Lower right: Sum of squared error for validation data.
à =
X
ϕjL
X
= ϕjL
L L−1 wjk ϕk
k
à =
+
θjL
k
Ã
ϕjL
! L L−1 wjk ok
X
³
fkL−1
+
θkL−1
´
! + θjL
ÃÃ L L−1 wjk ϕk
k
X i
! L−2 wL−1 ki oi
! +
θkL−1
! +
θjL
.
to get We differentiate the outputs directly by oL−2 i ∂ yˆj ∂oL−2 i
= ϕjL
0
X L L−10 L−1 (wjk ϕk wki ) ≡ gˆ L−2 ij . k
(A.1)
954
Jeong-Woo Lee and Jun-Ho Oh
3
1
10
2.5
Mean Squared Distance
2
1.5
y
1
0.5
0
10
−1
10
0
-0.5
-1 -2
−2
-1
0
1
2
3
4
5
6
10
7
0
100
200
300
400 500 600 Learning Epoch
700
800
900
1000
100
200
300
400 500 600 Learning Epoch
700
800
900
1000
x
2
2
10
10
1
1
10
Learning Error
Validation Error
10
0
0
10
10
−1
10
0
−1
100
200
300
400 500 600 Learning Epoch
700
800
900
1000
10
0
Figure 9: Experiment 3: Hybrid learning. Upper left: Learned result of neural network. Dotted line: underlying mapping; o: validation data; *: sample learning data. Upper right: Mean squared distance between neural network and underlying mapping (−0.5 < x < 6.5). Lower left: Sum of squared error for learning data. Lower right: Sum of squared error for validation data.
At the output layer, normalized delta is expressed as h it L0 ¯ Lm ≡ 0, . . . , 0, ϕm , 0, . . . , 0 , 1
(A.2)
0
L is the first-order derivative of the activation function of the mth in which ϕm element and the superscript L denotes the output layers. From equation 2.12,
(L−1)k = ϕk(L−1) δ¯m
0
X
Lj L δ¯m wjk
j 0
L L = ϕk(L−1) ϕm wmk 0
(A.3)
Mapping and Its Jacobian
955
and h i (L−1)hL−1 t (L−1)1 ¯ (L−1)2 ¯ L−1 ≡ δ¯m , δm , . . . , δ¯m 1 m h it 0 0 0 L0 L L0 L L0 L = ϕ1(L−1) ϕm wm1 , ϕ2(L−1) ϕm wm2 , . . . , ϕh(L−1) ϕ w , m mh L−1 L−1
(A.4)
where hL−1 is the number of nodes in the (L − 1)th hidden layer. If we apply the theorem, assuming (L − 1) to be the only hidden layer and (L − 2) the input layer, t ¯ L−1 ((W L−1 )t 1 m ) L−1 w11 ··· wL−1 1hL−2 wL−1 · · · · ·· 21 t ¯ L−1 = (1 .. .. .. m ) . . . L−1 · · · w wL−1 h(L−1) 1 hL−1 hL−2 h i ˆ L−2 ˆ L−2 = gˆ L−2 ≡ Gˆ L−2 m 1m , g 2m , . . . , g hL−2 m .
(A.5)
Equation A.5 denotes the gradient of yˆ m with respect to OL−2 . If we assume that the only hidden layer is (L − 1) and layer (L − 2) is the input layer, equation A.5 is equivalent to the Jacobian of the output vector with respect in equation A.5 using equation A.4, to the input vector. If we rewrite gˆ L−2 ij à gˆ L−2 ij
=
0 ϕjL
X
! L L−10 wL−1 ki wjk ϕk
.
(A.6)
k
Equations A.1 and A.6 are identical for all possible choice of j. This proves the theorem when only one hidden layer exists in the neural network. Next, we prove that if ∂ yˆj ∂Ol
¯ l+1 )t W l+1 = (1 j
(A.7)
is true, then ∂ yˆj ∂Ol−1
¯ l )t W l = (1 j
(A.8)
also holds. This means that if the theorem holds for (L − l − 1) hidden layers, from it holds for (L−l) hidden layers. If we extract one component ∂ yˆj /∂ol−1 i equation A.8, ∂ yˆj ∂ol−1 i
=
X k
δ¯jlk wlki
956
Jeong-Woo Lee and Jun-Ho Oh
=
X
à 0 ϕkl wlki
X m
k
! δ¯j(l+1)m wl+1 . mk
(A.9)
And if we apply the chain rule and use equations 2.5 through 2.7, ∂ yˆj ∂ol−1 i
=
∂ yˆj ∂Ol ∂Ol ∂ol−1 i
(using equation A.7)
¯ l+1 )t W l+1 = (1 j
∂Ol ∂ol−1 i
.
(A.10)
Furthermore, the following two equations hold: P
l+1 ¯ (l+1)m m wm1 δj P l+1 ¯ (l+1)m m wm2 δj
= P
¯ l+1 )t W l+1 (1 j
.. .
l+1 ¯ (l+1)m m wmhl δj
.
h 0 it 0 0 = ϕ1l wl1i , ϕ2l wl2i , . . . , ϕhl l wlhl i .
∂Ol ∂ol−1 i
(A.11)
(A.12)
If we substitute equations A.11 and A.12 into equation A.10, then we can easily see the result is identical with equation (A.9). Thus, by induction, the theorem is proved for any number of hidden layers. A.2 Derivation of Jacobian Learning Rule for General Case. In the following derivation, we use the notation that, in subscript im , m is a variable. For example, if m = 3, subscript im−1 denotes subscript i2 . For any m, we can rewrite gˆpq as X δ¯q1i1 w1i1 p gˆpq = i1
=
X i1
=
0
ϕi11 w1i1 p
X i1
δ¯q2i2 w2i2 i1
······ X 0 X 0 ϕi11 w1i1 p ϕi22 w2i2 i1 · · · i1
X im−1
i2
0 ϕi(m−1) wm−1 im−1 im−2 m−1
X im
δ¯qmim wm im im−1 .
(A.13)
And from equation 2.12, X ∂ δ¯q i−1 i00 δ¯q(i+1)t wi+1 i = ol ϕj tj . ∂wjl t ij
(A.14)
Mapping and Its Jacobian
957
Then we can write: X 0 X 0 X ∂ gˆ 0 (m−1)0 m−1 ¯mk ϕi11 w1i1 p ϕi22 w2i2 i1 · · · ϕi(m−2) wm−2 wlim−2 im−2 im−3 ϕl m = δq m−2 ∂wkl i1 i2 im−2 +
X i1
0
ϕi11 w1i1 p 0
X
00
X
= δ¯qmk ϕl(m−1) +om−1 ϕkm l X im−1
X
i1
i1
0
ϕi22 w2i2 i1 · · ·
i2
0
X
0
X
ϕi11 w1i1 p ϕi11 w1i1 p
i2
i2
0 m ϕi(m−1) wm−1 im−1 im−2 wkim−1 m−1
X im−1
0
m ϕi(m−1) wm−1 im−1 im−2 wkim−1 m−1
0
ϕi22 w2i2 i1 · · ·
X im−2
∂ δ¯qmk ∂wm kl
0
m−1 ϕi(m−2) wm−2 im−2 im−3 wlim−2 m−2
0
ϕi22 w2i2 i1 · · ·
X
δ¯q(m+1)t wm+1 tk
t
0 00 mpq (m+1)pq = δ¯qmk ϕlm µkl + om−1 ϕkm dmk . q µkl l
(A.15)
If we substitute equation A.15 into equation 3.4, we can obtain equation 3.8. Similar derivations can be made for equation 3.9. From equation 2.12, ∂ δ¯q
ij
j ∂θk
00
= ϕji
X
δ¯q(i+1)t wi+1 tj .
(A.16)
t
Using equations A.13 and A.16, we can write: X 0 X 0 ∂ gˆpq ϕi11 w1i1 p ϕi22 w2i2 i1 · · · m = ∂θk i1 i2 P X ∂( im δ¯qmim wm 0 im im−1 ) (m−1) m−1 ϕim−1 wim−1 im−2 m ∂θk im−1 X 0 X 0 1 1 2 2 = ϕi1 wi1 p ϕi2 wi2 i1 · · · i1
X im−1 00
i2
0 m m00 ϕi(m−1) wm−1 im−1 im−2 wkim−1 ϕk m−1
(m+1)pq
= ϕkm dmk q µkl
.
X
δ¯q(m+1)t wm+1 tk
t
(A.17)
If we substitute equation A.17 into equation 3.4, we can obtain equation 3.9. References Baba, N., Mogami, Y., Shiraishi, Y., & Yamashita, Y. (1995). New neural network training algorithm and its applications. Journal of Artificial Neural Networks 2:37–54. Cardaliaguet, P., & Euvrard, G. (1992). Approximation of a function and its derivative with a neural network. Neural Networks 5:207–220.
958
Jeong-Woo Lee and Jun-Ho Oh
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2:303–314. Ergezinger, S., & Thomsen, E. (1995). An accelerated learning algorithm for multilayer perceptrons: Optimization layer by layer. IEEE Trans. Neural Network 6(1):31–42. Freeman, J. A. S. & Saad, D. (1995). Learning and generalization in radial basis function, Neural Computation 7(8):1000–1020. Gallant, A. R., & White, H. (1992). On learning the derivatives of an unknown mapping with multilayer feedforward networks. Neural Networks 5:129–138. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks 2:359–399. Hornik, K., Stinchcombe, M., & White, H. (1990). Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks 3:551–560. Iwata, M. & Kitamura, S. (1994). An input tracking control system and an input estimation system using a feedforward model on neural network (in Japanese). Transactions of the Society of Instrument and Control Engineers 30(3):303–309. Jin, L., Nikiforuk, P. N. & Gupta, M. (1995). Fast neural learning and control of discrete-time nonlinear systems. IEEE Trans. Systems, Man and Cybernetics 25(3):478–488. Jordan, M. I., & Rumelhart, D. E. (1992). Forward models: Supervised learning with a distal teacher. Cognitive Science 16:307–354. Niyogi, P., & Girosi, F. (1996). On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Computation 8(4):819–842. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. Parallel Distributed Processing, Exploration in the Microstructure of Cognition. Cambridge, MA: MIT Press. Simard, P., Victorri, M., Le Cun, Y., & Denker, J. (1992). Tangent prop—a formalism for specifying selected invariances in an adaptive network. In J. E. Moody, R. Hanson, & R. P. Lippmann (Eds.), Advances in Neural Information Processing Systems 4 (pp. 895–903). San Mateo, CA: Morgan Kaufmann. White, D. A., & Jordan, M. I. (1992). Optimal control: A foundation for intelligent control. In D. A. White & D. Sofge (Eds.), Handbook of Intelligent Control (pp. 185–214). New York: Van Nostrand.
Received February 6, 1996; accepted September 10, 1996.
NOTE
Communicated by Ken Miller
The Joint Development of Orientation and Ocular Dominance: Role of Constraints Christian Piepenbrock Department of Computer Science, Technical University of Berlin, Berlin, Germany
Helge Ritter Faculty of Technology, University of Bielefeld, Bielefeld, Germany
Klaus Obermayer Department of Computer Science, Technical University of Berlin, Berlin, Germany
Correlation-based learning (CBL) has been suggested as the mechanism that underlies the development of simple-cell receptive fields in the primary visual cortex of cats, including orientation preference (OR) and ocular dominance (OD) (Linsker, 1986; Miller, Keller, & Stryker, 1989). CBL has been applied successfully to the development of OR and OD individually (Miller, Keller, & Stryker, 1989; Miller, 1994; Miyashita & Tanaka, 1991; Erwin, Obermayer, & Schulten, 1995), but the conditions for their joint development have not been studied (but see Erwin & Miller, 1995, for independent work on the same question) in contrast to competitive Hebbian models (Obermayer, Blasdel, & Schulten, 1992). In this article, we provide insight into why this has been the case: OR and OD decouple in symmetric CBL models, and a joint development of OR and OD is possible only in a parameter regime that depends on nonlinear mechanisms. 1 The Correlation-Based Learning Model Following the CBL-approach (Linsker, 1986; MIller, 1990), we consider four populations i = 1, 2, 3, 4 of geniculate “input” cells, which we call left-eyeON, left-eye-OFF, right-eye-ON, and right-eye-OFF, respectively, and one type of cortical cells receiving the afferents. Cells are modeled as connectionist-type neurons and are arranged in two-dimensional layers, where αE , βE denote geniculate locations and xE, yE specify cortical positions (in the same coordinate system). The geniculo-cortical projection is described by synaptic strengths SiαE ,Ex for the connections between geniculate cells (i, αE ) and cortical cells xE. The activity-driven development of SiαE ,Ex is generally described by a Hebbian model (Miller, 1990), X d i ij j i SαE ,Ex (t) = AαE ,Ex IxE,Ey CαE ,βE Sβ,E x, x (t) − ²AαE ,E E y (t) − γ SαE ,E dt E
(1.1)
j,β,E y
Neural Computation 9, 959–970 (1997)
c 1997 Massachusetts Institute of Technology °
960
Christian Piepenbrock, Helge Ritter, and Klaus Obermayer
where: • AαE ,Ex is the localized arbor function, which ensures topography and is established by an intrinsic process before activity-driven processes set in. AαE ,Ex represents the synaptic density between a geniculate neuron at αE (independent of layer i) and a cortical neuron at xE. • IxE,Ey denotes the effective intracortical interactions between neurons at locations xE and yE. This function couples the receptive field development of cortical neurons and thus induces correlations between the receptive field properties of neighboring cells (i.e., cortical map development). ij
• CαE ,βE are the two point correlation functions of geniculate activities at E These functions determine the structural properties of (i, αE ) and (j, β). the developing receptive fields—whether ON/OFF subfields or eye dominance emerge. • The last two terms are necessary to constrain the growth of synaptic weights in Hebbian models. Weights may be normalized by multiplicatively scaling them or by subtracting a constant. Multiplicative (M) constraints are implemented by a nonzero time-dependent γ ; subtractive (S) constraints require a nonzero time-dependent ² (Miller & MacKay, 1994). 2 Analysis of the Model We make two biologically motivated assumptions to characterize the conditions analytically under which OR and OD develop in this model framework. The first generally used assumption is symmetry of the activity correlation of ON and OFF cells and left and right eye. Symmetry leads to only four independent components of the correlation matrix {Cij }: C1 ≡ C11 = C22 = C33 = C44 same eye, same cell type C2 ≡ C12 = C21 = C34 = C43 same eye, different cell type C3 ≡ C13 = C24 = C31 = C42 opposite eye, same cell type
(2.1)
C4 ≡ C14 = C23 = C32 = C41 opposite eye, different cell type . Under this assumption, equation 2.1 decouples into four independent differential equations for the development of the modes, as shown in Table 1. We conclude that if development proceeds purely linearly without bound, the ratio of OR and OD becomes arbitrarily large or small. Thus, either ON/OFF segregation or OD segregation ultimately dominates. Erwin and Miller (1995) independently obtained this linear system decomposition.
Development of Orientation and Ocular Dominance
961
Table 1: Modes of Development Left
Left
Right
Right
ON
OFF
ON
OFF
Competition
Property
SSUM = (S1αE xE +S2αE xE +S3αE xE +S4αE xE ) αE xE
No competition
unspecific
SOD = (S1αE xE +S2αE xE −S3αE xE −S4αE xE ) αE xE
Left eye/right eye
OD
ON/OFF
OR+
ON/OFF
OR−
+
SαOR = (S1αE xE −S2αE xE +S3αE xE −S4αE xE ) E xE −
SαOR = (S1αE xE −S2αE xE −S3αE xE +S4αE xE ) E xE
Left
Right∗
eye
eye
∗ The
images show typical cortical receptive fields (arbor diameter A® = 13 grid points) for each mode. For unspecific and OD, the sum of the ON and OFF afferents is shown, whereas for OR+ and OR− their difference is shown (gray = 0).
The second assumption is translation invariance due to the locally homogeneous structure of the neuronal layers: AαE ,Ex = A(Eα−Ex) ,
IxE,Ey = I(Ex−Ey) ,
ij
ij
CαE ,βE = C(Eα−β) E .
(2.2)
Fourier transformation of equation 1.1 then yields X d ˆi ij j S E E(t) = Aˆ w, ∗ IˆkECˆ wE Sˆ E(t) − γ Sˆ iE E(t) − ² Aˆ w, E E E kE , k w,k E k w, dt w,k j
(2.3)
where ˆ denotes the Fourier transform of the corresponding quantity and ∗ is the convolution operator. Diagonalization of IˆCˆ ij yields a system that can easily be solved for its modes in the case of infinite arbors (AαE ,Ex = 1). We + − , λODE , λORE , and λORE , one corresponding to each obtain eigenvalues λSUM E kE E k E k E k w, w, w, w, mode in the table (Erwin & Miller, 1995; Piepenbrock, Ritter, & Obermayer, ij 1996). These eigenvalues depend on the input correlation functions Cˆ E and E k w,
the cortical interaction function IˆkE. The modes with the strongest eigenvalues will dominate the development process. The eigenvalue analysis for AαE ,Ex = 1 gives a good approximation of a realistic system’s behavior since a synaptic density function A with a finite radius mainly results in a smoothing in the spatial frequency domain due to the convolution but does not basically alter the principal mode. Based on these two assumptions, we can now derive conditions on the eigenvalues and (thereby indirectly) on the geniculate correlation functions and activity patterns. Most cortical receptive fields show ocular dominance, but they are unstructured with respect to input signals from the left and right eye (they have no left or right eye subfields). Consequently, for OD
962
Christian Piepenbrock, Helge Ritter, and Klaus Obermayer
E = 0 (Miller development λODE must have its maximum at spatial frequency w E k w, et al., 1989). Orientation selectivity in simple cells, on the other hand, is based on ON and OFF subfields that alternate at a typical spatial frequency within + the receptive field (Miller, 1994). Therefore we require λORE to have its peak E k w, E Hence, we make the ansatz: at this typical spatial frequency |w|. 1 −Eα2 /ϕ 2 e ϕ2 ¶ µ 1 −Eα2 /σ 2 1 −Eα2 /(νσ )2 cξ (2.4) = C1αE − C2αE + C3αE − C4αE = e − e σ2 σ 2ν2
= C1αE + C2αE − C3αE − C4αE = COD αE +
COR αE
C1αE = −C4αE ;
C2αE = −C3αE
and we obtain λODE = IˆkECˆ OD E ; w E k w,
OR λORE = IˆkECˆ w ; E +
+
E k w,
−
OR λSUM E = λ E = 0. E k w,
E k w,
(2.5)
The parameters ϕ, σ , and ν denote the widths of the activity correlation + functions, ξ denotes the ratio of λOR and λOD , and c is a normalization OR+ constant. The choice for λ E has a maximum at spatial ON/OFF waveE k w, p length πσ (ν 2 − 1)/(2 ln ν) of any orientation. Preference for a particular orientation is then a result of symmetry breaking due to the localized arbor function. OR may develop in our model framework driven by either the OR+ or the OR− mode. For OR− the receptive fields of binocular cortical cells have ON and OFF subfields exchanged in the left and right eye (see Table 1), but the resulting orientation maps are equivalent in both cases. + Thus we consider only λORE development. E k w, In a linear symmetric CBL model, OR and OD develop independent of each other. Equally strong development of OR and OD is not possible except at the infinitely sharp phase boundary. The structure of the phase boundary changes, however, as soon as nonlinearities are incorporated. Erwin and Miller (1995) recently presented simulations using appropriate nonlinearities and showed that a finite parameter regime emerges in which OR and OD codevelop. Let us investigate how this parameter regime varies with the choice of nonlinearities. In the purely linear case, the OR/OD phase + transition can be expected for equally strong modes λODE , and λORE , and we E k E k w, w, choose the parameters cξ in equation 2.4 such that both eigenvalues have 2 equal maxima for ξ = 1. This is the case for c = ν 2/(ν −1) /(1 − ν −2 ); that is, ξ + represents the ratio of the maxima of λORE and λODE . This type of correlation E k w,
E k w,
Development of Orientation and Ocular Dominance
963
structure allows us to analyze the mode mixing of OR and OD in a simple setting. For ξ ¿ 1 all neurons become monocular, and we expect the phase boundary at ξ = 1; for ξ À 1, binocular orientation selective receptive fields develop. We may biologically interpret the meaning of this eigenvalue structure. The development of OD requires positive activity correlations COD between neurons (averaged over ON and OFF) that decrease with distance within one eye (Miller et al., 1989). OR, however, can develop only if the activities + of ON and OFF neurons differ and COR ∝ C1 − C2 forms a Mexican hat– − like correlation structure (Miller, 1994). Our assumption of λSUM and λOR to be equal to zero is biologically not very realistic. It means that the activity correlations within the eye (C1αE and C2αE ) have to be of the opposite sign as the correlations between the eyes (C4αE and C3αE ) (see equation 2.5). Such correlation functions may be introduced by inhibitory connections between the geniculate layers but are unlikely after birth and eye opening when the left and right eye activities should be correlated. The assumption is not essential for a realistic development of receptive fields, however. Even − for nonzero λSUM and λOR , the outcome of the development is basically unchanged as long as these modes do not become the leading modes and govern the development process (Erwin & MIller, 1995; Piepenbrock et al., 1996). 3 Simulations To study the transition zone between OD and OR development, we start with a reduced model setup and add mechanisms one by one to analyze the effects of different nonlinearities separately. We systematically vary the following components of the model: • Geniculate (lateral geniculate nucleus, LGN) neurons project to the cortex in a topographic order. This projection is enforced by the synaptic density function AαE ,Ex . In our implementation, it is simply chosen as 1 for |E x − αE | < 12 A® and 0 else. To verify the approximation from equation 2.3, we also simulated a fully connected AαE ,Ex = 1 network. • Cortical neurons interact with each other. When they effectively excite each other at short distances and inhibit each other at larger distances, typical cortical maps develop for OD (with OD bands) and OR (with pinwheels). I is given by IxE,Ey = e−(Ex−Ey)
2
/µ2
−
1 −(E x−E y)2 /(ζ µ)2 ζ2 e
+ δxE,Ey .
(3.1)
We compared this with simple independent cortical neurons (IxE,Ey = δxE,Ey ) to study the influence of cortical interaction on orientation selectivity.
964
Christian Piepenbrock, Helge Ritter, and Klaus Obermayer
• Synaptic weights should be positive or negative. Weights that become zero in our model have to be clipped by a hard nonlinear constraint. To study the behavior of the linear model, we also consider synaptic weights that can change their sign during development. • Hebbian learning models require synaptic weight normalization. We use hard constraints that normalize the total weight at every time step and do not consider constraints over the projective field of LGN neurons. Subtractive and multiplicative constraints that keep the total synaptic weight constant are called S1 and M1, respectively. A multiplicative M2 constraint limits the total squared synaptic weight (Miller & MacKay, 1994). We consider two normalization approaches: total synaptic strength is constrained (1) for each cortical neuron separately or (2) over the whole cortex, a biologically less realistic model that (together with a multiplicative constraint) uniformly scales all synaptic weights but does not qualitatively alter the behavior of the unconstrained model. All simulations of cortical OR and OD development are run with cortical neurons arranged on a 32 × 32 grid with periodic boundary conditions. The random initial weights are identical for all simulations (AαE ,Ex ± 20 percent uniformly distributed noise), and parameters are A® = 13, ϕ = 7.1, σ = 2.1, ν = 3, µ = 2, and ζ = 3. At each iteration step we (1) compute the unconstrained derivatives and (2) apply an S1 constraint (if desired); then we (3) add the derivatives to the synaptic weights SiαE ,Ex , (4) clip the weights (if desired), and (5) normalize the weights multiplicatively (using M1 or M2 constraints). The S1 constraint applies only to unclipped synapses and keeps the total synaptic weight constant. Nevertheless, some weights might get clipped in step 4, and a multiplicative M1 renormalization (step 5) becomes necessary (this is the update scheme from Miller, 1994). The update step size is chosen at the first iteration to change the initial weights by 1 percent on average. The simulations run for 1000 update steps, or until 95 percent of the synaptic weights reach the limits. The receptive fields are usually fully developed after 100 to 200 steps (average weight change less than 1/1000 percent relative to initial weights), and test simulations with up to 5000 steps near the phase boundaries did not show any further change in the synaptic strengths. For each map we computed three different values: OD, OR, and OS. The OD value is the root mean square (r.m.s.) eye dominance (Erwin & Miller, 1995), and the OR value is the r.m.s ON/OFF subfield segregation value of the cortical neurons. Orientation-selective neurons need not only segregated ON and OFF subfields (high OR value) but also an arrangement of those subfields that yields orientation-specific responses (high OS value). The degree of orientation tuning is measured by the mean cortical orientation
Development of Orientation and Ocular Dominance
965
specificity OS (in the spirit of Miller, 1994): v à P OD !2 u X 1u αE SαE ,E x t (3.2) OD = P SUM N |S αE αE ,E x | xE v à P OR+ !2 u X 1u αE SαE ,E x t (3.3) OR = P SUM N |S αE αE ,E x | xE q¡ P ¡ ¢ ¢ ˆ OR+ | sin 2ϕwE 2 + P ˆ OR+ | cos 2ϕwE 2 X E =0 |Sw,E E =0 |Sw,E w6 w6 E x E x . (3.4) OS = P + N w6E =0 |Sˆ OR xE E x | w,E The OS value represents the sharpness of the neurons’ orientation tuning curves where each neuron is stimulated with wave patterns of different orientations ϕwE at different spatial frequencies ω (which results in wave vecE in the E = ω[cos ϕwE , sin ϕwE ]T ). The response to different wave patterns w tors w OR+ ˆ linear model is given by the power spectrum of Sw,E E x Fourier transformed along the LGN dimensions. OS values may range from 0 for cortical maps without any orientation preference up to OS = 1 for neurons that respond to exactly one orientation. Realistic values are generally 0.05 for no OS and about 0.1 to 0.3 for well-orientation-tuned neurons. The OS value alone is not sufficient as a measure for cortical orientation selectivity because it measures the width of the orientation tuning curve irrespective of its amplitude. For our simulation data, the orientation selectivity may be computed even if the cell orientation response rate (OR value) is almost zero. This is biologically unrealistic, and therefore we report the OS values only to show that the simulation results with high OR value actually yield orientation-selective cells. In a biologically more realistic definition, OR and OS would have to be determined for each eye separately weighted by the neurons’ eye dominance values. This, however, is not necessary in our simulations because our model yields only cortical maps with identical orientation patterns in both eyes. 4 Results In our simulations we begin with a biologically unrealistic simple linear setup of the model and then add nonlinearities and constraints. The results of the simulations are shown in Figure 1. In the simplest case (top left), we assume full connectivity between the LGN and the cortex, synaptic weights that may be negative or positive, and no cortical interaction. Synaptic growth is limited by an M2 constraint. The system is completely linear, and we find a sharp OD/OR phase boundary for ξ = 1 as predicted by the linear system analysis. A joint development of OD and OR is possible only
966
Christian Piepenbrock, Helge Ritter, and Klaus Obermayer
no interaction weights + /− arbor infinite constraint M2 no weight limit
weights + /− arbor local constraint M2 no weight limit
weights + arbor local constraint M1 no upper weight limit
weights + arbor local constraint S1 upper weight limit
weights + arbor local constraint S1 upper weight limit weight freezing
1
OD
OR
with interaction 1
OR
OD
OS
OS 0 0.1
1
1
OD
10
OR
0 0.1
1
1
OD
OR
OS 0 0.1
1
1
OD
10
OR
OS 0 0.1
1
1
OD
1
1
OD
10
OR
OS 0 0.1
1
1
OD
1
1
OD
10
OR
OS 0 0.1
1
1
OD
1
10
10
OR
OS 0 0.1
10
OR
OS 0 0.1
10
OR
OS 0 0.1
10
OS 0 0.1
1
10
Figure 1: Phase diagrams of cortical OR and OD development. Parameter on OR+ the abscissa is ξ (i.e., the ratio of the maxima of λw, and λOD ). For infinite E kE E kE w, ˆ arbors the OD/OR phase transition is at ξ = 1. Convolving the localized A + with λOR and λOD results in a predicted phase transition at ξ = 0.73. The left column represents simulation results where neurons develop independent of each other (no cortical interaction), whereas results with cortical interaction (and the constraint enforced separately for each neuron) are shown on the right side.
Development of Orientation and Ocular Dominance
967
if ξ is exactly one. In this fully connected network, however, neurons may develop ocular dominance and ON/OFF receptive subfields, but the cells do not become orientation selective (low OS value) because the receptive fields have a large number of ON and OFF subfields in a random pattern. Orientation selectivity requires a certain ratio between arbor diameter and the strongest-growing spatial frequency mode. In our case, the arbor function localizes synaptic growth and thus smooths the growing modes in the spatial frequency domain. This results in a shift of the phase boundary from ξ = 1 to a lower value. The system is still linear, and a joint development of OD and OR does not occur. Now we introduce nonlinear mechanisms to study conditions for coupled OR and OD development. When we couple the development of neighboring neurons by introducing a cortical interaction function, we do not necessarily make the system nonlinear. If we enforce the constraint over the whole cortex (biologically not very realistic), the system remains linear, and we obtain graphs very similar to the ones shown in the left column (data not shown). Their orientation specificity, however, is slightly lower than in the case of independent cortical neurons (as has been noted by Miller, 1994) because the interactions force some cells to develop receptive fields that are not ideally tuned for orientation (e.g., around pinwheels). An M1 constraint simply scales all the weights, but if we apply it separately over each cortical receptive field (as for all results shown in Figure 1), then introducing an interaction function makes the system nonlinear. As a result we obtain a finite parameter domain in which OR and OD may develop concurrently. Synaptic weights should not change their sign in a biologically realistic model. Therefore we have studied models where we clip the weights at zero. We compare the results for zero weight clipping with M1 (a scheme we have used in Piepenbrock et al., 1996) to zero and upper weight clipping with S1 constraints (used in Erwin & Miller, 1995). The two growing modes + SOD and SOR are both zero-sum modes; their growth does not alter the summed synaptic weight over each receptive field. Thus, under both M1 and S1 constraints, synaptic growth is initially unconstrained. Only when some synapses have reached their clipping values does the constraint begin to limit growth (Miller & MacKay, 1994). The only non-zero-sum modes are modes of SSUM , so only if SSUM modes had nonzero eigenvalues would the constraint influence the development process from the beginning. The graphs in the third row show simulation results for only positive synaptic weights. We use an M1 constraint that simply scales the weights (like an M2 constraint). The only significant difference between the second and the third row is the clipping of weights at zero. In the absence of intracortical connections (left column), this creates a finite parameter domain for joint OD/OR development. If the weights are allowed to change their sign, an M2 constraint is necessary to limit synaptic growth (second row). For weight clipping, however, an M1 (third row, left) and an M2 (data not
968
Christian Piepenbrock, Helge Ritter, and Klaus Obermayer
shown) constraint yield virtually identical results in the case of independent cortical neurons. Both constraints scale the synaptic weights but do not qualitatively alter the outcome of the unconstrained development, and the only nonlinearity is the weight clipping. The last two rows present the simulation results for subtractive constraints. In this case the weights have to be clipped at zero as well as at some maximum value (Miller & MacKay, 1994). Without maximum weight clipping, all but one synaptic weight within the receptive field would decay to zero. The size of the final receptive fields can be influenced by the choice of the maximum value, and we chose eight times the mean initial synaptic weight, as in Miller (1994). The difference between the last two rows is the treatment of synaptic weights that have saturated at zero or at the maximum value. If we limit the synaptic weights but allow them to change again later in the development process, we obtain the graph in the second-last row. The last row results if weights are frozen as soon as they reach the minimum or maximum value, as in Erwin and Miller (1995). OR and OD develop about equally strong near the phase boundary, and the amount of “randomness” in the initial synaptic weights determines how long it takes until the stronger mode begins to dominate the development process. If the weights are frozen at their limit values before this happens, we observe a joint development of OR and OD. On the other hand, if they are only clipped at the limits, the weights may change again later, and the influence of the initial conditions is reduced. This effect yields a slightly larger joint development parameter domain for the frozen synaptic weights in the bottom graphs. The simulations in the bottom-right graph use the same mechanisms as Erwin and Miller (1995) and almost identical parameters.1 It is not surprising that the last graph exhibits the largest parameter domain for joint OR and OD development since the simulation includes several nonlinear mechanisms— a minimum and a maximum synaptic weight value, as well as freezing of the weights that reach the limits. In the results shown in Figure 1, we varied the nonlinear mechanism but did not systematically change all the other model parameters. Our simulations (including simulations using the parameters from Erwin & Miller, 1995), however, indicate that the effects of the nonlinear mechanisms are robust against changes in ϕ, σ , ξ , ζ , and µ, and the parameter domains of joint OR and OD development remain qualitatively unchanged (i.e., the sharp phase transition of the purely linear CBL model changes into a parameter region of joint OR/OD development in a nonlinear model). This region is enlarged when more nonlinear mechanisms are added (as in Figure 1).
1 They report better orientation selectivity for smaller parameters σ and for a tapered arbor function AαE ,Ex . The effects of changes in the correlation functions on OR development are discussed in Miller (1994).
Development of Orientation and Ocular Dominance
969
5 Summary The main result of this note is that OR and OD decouple in the linear symmetric model and a sharp phase boundary divides the parameter domains of OR and OD. Nonlinear mechanisms may create a transition region near the phase boundary where a joint development of OR and OD is possible. We have identified three such mechanisms: (1) normalizing synaptic weights separately for each cortical neuron with M1/M2 constraints, (2) clipping of synaptic weights, and (3) freezing of synaptic weights. The size of the transition region depends on the nonlinearities present in the model, and the transition zone is broadest if the mechanisms are applied simultaneously. The basic assumption underlying CBL is that the modeled receptive field − + features develop early, when the developments of SOD , SOR , and SOR are essentially linear. This is true for ξ ¿ 1 or ξ À 1, and all our simulations yield almost identical OD or OR maps. Close to the phase boundary, however, where OD and OR develop concurrently, the nonlinear mechanisms we have studied (M constraints, weight clipping, weight freezing) lead to different outcomes for identical initial conditions. Hence, the final receptive field properties are determined late at a time when the nonlinearities set in to stabilize the receptive fields. When discussing the biological significance of a model, it is very important to investigate carefully the effects of the different mechanisms involved. CBL models require constraints to limit synaptic growth, and for this reason we tested different types of constraints in our simulations. These constraints are only clumsy approximations for the real biological mechanisms, but the simulations give hints on which aspect of the constraining mechanism may be most important for the joint development of OR and OD. Multiplicative constraints implement synaptic decay that is proportional to the synaptic strength, whereas under subtractive constraints, all synapses decay at the same rate, irrespective of their strength. Furthermore, subtractive constraints require weight clipping not only at zero but also for some maximum value (Miller & MacKay, 1994). It is plausible that such constraints exist in biological neurons (Miller & MacKay, 1994), but their underlying mechanisms have not been identified yet. The simulation results show that the size of the parameter domain for joint OD/OR development depends more on weight clipping or freezing and on separate constraints for each neuron than on the choice of M1 or S1 constraints. Furthermore, this parameter domain is enlarged if different nonlinear mechanisms act together. Note After part of this work was completed, we learned that Erwin and Miller (1995) independently derived the diagonalization of equation 1.1. We thank Ken Miller for his review and discussions that helped to improve this paper.
970
Christian Piepenbrock, Helge Ritter, and Klaus Obermayer
Acknowledgments This research was funded in part by DFG (grant Ob 102/2-1) and a scholarship from the Boehringer Ingelheim Fonds to C. P. Computing time (CM5) was made available by NCSA and HLRZ Julich. ¨ References Erwin, E., & Miller, K. D. (1995). Modeling joint development of ocular dominance and orientation in primary visual cortex. In Proc. of the CNS-95. Erwin, E., Obermayer, K., & Schulten, K. (1995). Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neural Computation 7:425–468. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of orientation selective cells. Proc. Natl. Acad. Sci. USA 83:8390–8394. Miller, K. D. (1990). Derivation of linear Hebbian equations from a nonlinear Hebbian model of synaptic plasticity. Neural Computation 2:321–333. Miller, K. D. (1994). A model for the development of simple cell receptive fields and the ordered arrangements of orientation columns through activitydependent competition between ON- and OFF-center inputs. Journal of Neuroscience 14:409–441. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science 245:605–615. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Comp. 6:100–126. Miyashita, M., & Tanaka, S. (1991). A mathematical model for the selforganization of orientation columns in visual cortex. Neuroreport 3:69–72. Obermayer, K., Blasdel, G. G., & Schulten, K. (1992). A statistical mechanical analysis of self-organization and pattern formation during the development of visual maps. Phys. Rev. A 45:7568–7589. Piepenbrock, C., Ritter, H., & Obermayer, K. (1996). Cortical map development driven by spontaneous retinal activity waves. In Artificial Neural Networks— ICANN 96. New York: Springer-Verlag.
Received September 15, 1995; accepted November 26, 1996.
NOTE
Communicated by Michael Shadlen
Physiological Gain Leads to High ISI Variability in a Simple Model of a Cortical Regular Spiking Cell Todd W. Troyer Keck Center for Integrative Neuroscience, Department of Physiology, University of California, San Francisco, San Francisco, CA 94143 USA
Kenneth D. Miller Keck Center for Integrative Neuroscience and Sloan Center for Theoretical Neurobiology, Departments of Physiology and Otolaryngology, University of California, San Francisco, San Francisco, CA 94143 USA
To understand the interspike interval (ISI) variability displayed by visual cortical neurons (Softky & Koch, 1993), it is critical to examine the dynamics of their neuronal integration, as well as the variability in their synaptic input current. Most previous models have focused on the latter factor. We match a simple integrate-and-fire model to the experimentally measured integrative properties of cortical regular spiking cells (McCormick, Connors, Lighthall, & Prince, 1985). After setting RC parameters, the postspike voltage reset is set to match experimental measurements of neuronal gain (obtained from in vitro plots of firing frequency versus injected current). Examination of the resulting model leads to an intuitive picture of √ neuronal integration that unifies the seemingly contradictory 1/ N and random walk pictures that have previously √ been proposed. When ISIs are dominated by postspike recovery, 1/ N arguments hold and spiking is regular; after the “memory” of the last spike becomes negligible, spike threshold crossing is caused by input variance around a steady state and spiking is Poisson. In integrate-and-fire neurons matched to cortical cell physiology, steady-state behavior is predominant, and ISIs are highly variable at all physiological firing rates and for a wide range of inhibitory and excitatory inputs. 1 Introduction Softky and Koch (1993) have shown that the spike trains of cortical cells in visual areas V1 and MT display a high degree of variability, as measured by their interspike interval (ISI) distributions at fixed mean firing rates. Does this result require revising the classical notion of information transmission in cortical neurons? By “classical notion,” we mean the view that neurons integrate many synaptic inputs until reaching some voltage threshold, and at that point produce a spike. Spikes are then followed by a refractory period, Neural Computation 9, 971–983 (1997)
c 1997 Massachusetts Institute of Technology °
972
Todd W. Troyer and Kenneth D. Miller
during which the cell is less likely to produce another spike. If inputs are uncorrelated, the mean spike rate of such a model neuron depends on the average frequency of presynaptic events and thus can embody the standard notion of rate coding. Softky and Koch (1993) argued that high ISI variability is inconsistent with a neuron’s acting as an EPSP integrator. The integration of many uncorrelated excitatory synaptic potentials (EPSPs) should result in a time to threshold (or ISI) that is very regular, since the neuron averages over many random events. One measure of spike variability is the coefficient of variation (CV) of the ISI distribution, defined as the standard deviation divided √ by the mean. For an EPSP integrator, this should be approximately 1/ N, where N is the number of events necessary to reach threshold. In contrast, cortical cells have CVs in the range .5 to 1, more closely resembling a random (i.e., Poisson) process for which CV = 1. As an alternative to simple integration, Softky and Koch (1993) proposed that active dendritic conductances may act to amplify submillisecond coincidences in synaptic input, leading to large random pulses of synaptic current. Thus, they suggested that high ISI variability may be more consistent with the notion (Abeles, 1982) of neurons acting as “coincidence detectors” rather than “rate encoders.” Shadlen and Newsome (1994), following Gerstein and Mandelbrot (1964), argued that if uncorrelated synaptic inputs to a cell consist of balanced excitation and inhibition, then the membrane voltage follows a random walk, leading to ISIs that are quite variable. This balanced inhibition model is consistent with the classical notion of synaptic integration and rate coding but requires that excitation and inhibition be tightly balanced in order to achieve ISI variability consistent with the data. These arguments can be more clearly understood by considering spike production as a two-step process. First, synaptic inputs are integrated by an extensive and complex dendritic tree resulting in a total synaptic current I(t). Second, the cell produces spikes in response to this synaptic current. Thus, we expect that spike variability should depend on two factors: the sensitivity of the spike mechanism to changes in the synaptic current and the variability in this input current. Previous discussions of ISI variability have largely focused on the second term, that is, under what conditions sufficient input variability can be achieved. Thus, Softky and Koch (1993) argued that input current variability was too low to account for output spiking variability, while Shadlen and Newsome (1994) argued that a balance of excitation and inhibition could yield high input variability. Other models demonstrated that network dynamics can lead to correlations among synaptic inputs sufficient to yield high variability (Usher, Stemmler, Koch, & Olami, 1994; Hansel & Sompolinsky, 1996) (afferent inputs also have significant correlations; Alonso, Usrey, & Reid, 1996). However, these investigations generally did not address the sensitivity of cortical neurons. Neuronal sensitivity has been addressed by Bell, Mainen, Tsodyks, and Sejnowski (1995), who considered some of the
Variability in a Cortical Regular Spiking Cell
973
same factors as our work in the context of a Hodgkin-Huxley model.1 Our work differs in considering simple integrate-and-fire dynamics focusing on the roles of gain and refractoriness, and, most important, matching model parameters to experimental data from cortical neurons. For the purposes of this article, we define neuronal gain as the slope of a plot of firing frequency f = 1/ISI versus the (constant) level of injected current I. Although there is no direct relationship between this measure and neuronal sensitivity under physiological conditions—neuronal gain is measured in the absence of input variability—we will argue that the two should be well correlated. Here we demonstrate that matching a simple integrateand-fire model to experimentally measured values of neuronal gain can largely solve the conundrum of high ISI variability in visual cortical cells with many random synaptic inputs. A careful examination of this model leads to a relatively simple, intuitive picture√ of cortical neuronal integration that unifies the seemingly contradictory 1/ N and random walk pictures that have been previously proposed. A preliminary report of this study has appeared as an abstract (Troyer & Miller, 1995). 2 The High-Gain Model In an integrate-and-fire model, the slower, integrative properties of neurons are modeled as passive changes in subthreshold membrane voltage. The fast spiking conductances are lumped into a stereotyped event that is “pasted onto” the voltage trace when the cell reaches threshold. After spiking, there is an absolute refractory period during which the cell cannot spike, followed by a relative refractory period during which the cell’s ability to spike is reduced. We will use refractoriness to refer to the combined effect of all processes occurring after a spike that make a cell less likely to respond to a given pattern of input current with a subsequent spike. Refractoriness can be defined quantitatively as a function r of the time t after a given spike as follows. Suppose a just-threshold DC current is applied at or before the spike time, t = 0. The refractoriness, r(t), is the magnitude of a brief current pulse that, superimposed on this ongoing DC current at time t, is just sufficient to elicit a spike.2 Note that this definition includes all factors that contribute to the cell’s refractoriness, including sodium channel inactivation and the ef1 Bell et al. (1995) and Wilbur and Rinzel (1983) also investigated bistability in neuronal dynamics as a mechanism that could contribute to ISI variability. 2 Assume a spike occurs at t = 0, and that after an absolute refractory period t refract the voltage is reset to Vreset . Let τ be the membrane time constant, gleak the leak conductance, and 1t the duration of the brief current pulse. Beginning from Vreset at time t = trefract , let It0 be the DC current that leads to a spike at time t0 (so I∞ is the just-threshold DC current). τ gleak (Vt (t) − V∞ (t)) = Let Vt0 (t) be the time course of the voltage given It0 . Then r(t) ≡ 1t τ −(t−t )/τ refract (It − I∞ )(1 − e ) 1t .
974
Todd W. Troyer and Kenneth D. Miller
fects of other active conductances, as well as spike after-hyperpolarization.3 In the simplest models, refractory effects are modeled by waiting for a fixed, absolute refractory period after spike onset and then resetting the voltage to a value significantly below threshold. Thus, a single parameter, the depth of the after-spike voltage reset, serves to model all of the elements contributing to the cell’s relative refractoriness.4 More realistic models of the complex events following a cortical spike may deepen our understanding of cortical integration, but such models are experimentally poorly constrained and can lose the clarity motivating the use of simple models. We use a simple conductance-based integrate-and-fire model: C
∂V = gleak (Vrest − V) + I. ∂t
(2.1)
Parameters were matched to slice recordings of regular spiking cells (values from McCormick et al., 1985, noted in parentheses): Vrest = −74 (73.6 ± 1.5) mV; 1/gleak = 40 (39.9 ± 21.2) MÄ; C = τ gleak where τ = 20 (20.2 ± 14.6) msec.5 The absolute refractory period was taken to be the spike width, tspike = 1.75 (1.74 ± .41) msec. Spike threshold Vthresh was set 20 mV above Vrest : Vthresh = −54 mV. This leaves the after-spike reset voltage, Vreset , as the only free parameter. This parameter is commonly set equal to Vreset , but this choice lacks physiological justification. We set Vreset to match the model’s f-I curve to that observed physiologically, as reported in McCormick et al. (1985) (see Figure 1A). That is, we set Vreset to match the physiologically observed neuronal gain.6 This yields Vreset = −60 mV, 6 mV below threshold. Using the f-I curve to set Vreset points to the strong connection between refractoriness and neuronal gain. Once τ and gleak are determined from biophysical measurements, neuronal gain is determined by the cell’s refractory behavior. A cell that is more refractory requires more input current to spike at any given frequency, resulting in an inverse relationship between refractoriness and neuronal 3 After-hyperpolarization—the voltage state of the cell—is often regarded as affecting integration rather than refractoriness, with “refractoriness” reserved for processes that alter the cell’s spike responses at a given voltage, such as spike threshold elevation. When considering a cell’s reduced responsiveness to input currents, however, it is most useful to combine all factors contributing to the reduction. 4 Alternatively, one may consider a two- (or three-) parameter model of refractoriness in which instead of (or in addition to) postspike voltage reset, there is a postspike elevation of threshold that decays at some rate (reviewed in Tuckwell, 1988, Table 3.1). 5 The term regular spiking denotes the class of cortical cells that respond to constant current injection in vitro with single-spike firing and spike-rate adaptation (McCormick et al., 1985). These constitute most cortical excitatory cells. The term does not imply regularity of firing in vivo. 6 Matching the slope of the model f-I curve at 100 Hz to the average gain reported by McCormick et al. (1985) for cortical regular firing cells gives Vreset = −59.75 mV.
Variability in a Cortical Regular Spiking Cell
975
Figure 1: (A) f-I plots for our high-gain model (solid line) and experiment (dashed line). For comparison, the dotted line represents the more common low-gain model, in which Vreset is set to Vrest . Experimental data shown are piecewise linear fits to data for first interspike interval from three cells (McCormick et al. 1985, Fig. 1). (B) ISI histogram from 10 sec of model data: λex = 7015 Hz; λin = 3263 Hz (R = .75); average firing rate = 98.4 Hz; CV = .6. Inset shows ISIs > 20 msec. (C) Voltage trace of first 200 msec of simulation in (B).
gain.7 Similarly, greater refractoriness implies weaker sensitivity to physiological inputs. This explains why we connect high gain measured with constant currents to high sensitivity under physiological conditions: For a given membrane time constant and input resistance, weaker refractoriness implies both higher gain and higher physiological sensitivity. Synaptic input took the form P of Poisson distributed delta function conductance changes, Isyn (t) = k gsyn δ(t − tk )(Vsyn − V), where tk denotes the τ Using the definitions of note 2, r(t) = (It − I∞ )(1 − e−(t−trefract )/τ ) 1t . Writing gain G( f ) as a function of firing frequency f (G( f ) is the slope of the f -I curve at f ), this 7
τ yields r(t) = (1 − e−(t−trefract )/τ ) 1t
refractoriness.
R f =1/t f =0
df/G( f ). Thus, high gain corresponds to low
976
Todd W. Troyer and Kenneth D. Miller
arrival time of the kth presynaptic spike. gsyn represents the total conductance, integrated over the time course of the synaptic event, and thus has units nS msec. gsyn was taken from an exponential distribution with mean g¯ syn ; values greater than 4 g¯ syn were then reset to this maximum value.8 For excitatory synapses, Vex = 0 mV, and g¯ ex = 3.4 nS msec. This yields EPSPs at rest of mean .49 mV and maximum 2 mV. For inhibitory synapses, Vin = −70 mV, and g¯ in = 22.8 nS msec. This yields inhibitory postsynaptic potentials (IPSPs) that are twice as large as EPSPs at threshold. The amount of inhibition was expressed as the ratio R of the mean inhibitory current to the mean excitatory current at threshold: R=
λin g¯ in |Vin − Vthresh | . λex g¯ ex |Vex − Vthresh |
(2.2)
Here λ expresses the mean rate of Poisson input. Note that R = 1 implies that when voltage is clamped at Vthresh , the total synaptic current has mean zero; R < 1 corresponds to a surplus of excitatory synaptic current at threshold, while R > 1 corresponds to a surplus of synaptic inhibition. The leak current adds hyperpolarizing current. Estimates for λex and λin rely on counting arguments only weakly constrained by experimental data (Shadlen & Newsome, 1994). Thus, to examine how model behavior depends on λex and λin , we fix R and then covary λex and λin to yield varying output firing rates for that R. We have considered values of R ranging from 0 to 1.25; we will take R = .75 to be a reasonable guess for the ratio of inhibition to excitation in cortex. Balanced inhibition models (Gerstein & Mandelbrot, 1964; Shadlen & Newsome, 1994) commonly assume equally likely positive and negative voltage steps, yielding a random walk in voltage. In a conductance-based model, this translates into a balance of inhibition and excitation sufficient to give a subthreshold mean current (so that there is not a steady drift up to near Vthresh ). We will roughly equate balance inhibition with R ≥ 1 (although smaller values are conceivable, since even for R = 1 the leak current maintains the mean voltage somewhat below threshold).9 3 Simulation Results The results of a typical simulation (R = .75) are shown in Figure 1. Postsynaptic average firing rate was 98 Hz for 10 sec of simulated data. We measured variability by taking the CV of the ISI histogram, and found CV = .6, which falls in the physiological range of .5 to 1 reported in Softky and Koch (1993). Thus the mean of the final distribution for gsyn is (1 − e−4 ) g¯ syn . The exact distance below threshold depends on the relative magnitudes of the leak and mean synaptic conductances. In the limit of high input rates, the leak is negligible, and R = 1 corresponds to a mean synaptic current just sufficient to elicit depolarization to threshold. 8 9
Variability in a Cortical Regular Spiking Cell
977
What expectations do we have for CV in these simple models? We expect CV to be bounded above by the CV of a Poisson process with dead time equal to the minimum recovery period. With mean ISI µ and dead time tdead , CVmax = (µ − tdead )/µ. For the simulation in Figure 1, tdead = tspike gives CVmax = .83, while taking tdead equal to the shortest interval observed (2.75 msec) yields CVmax = .73.10 A rough lower bound can be obtained from the case of pure EPSP integration. In our high-gain model, it takes 16 average-sized EPSPs to reach threshold from reset (recall that EPSPs are reduced by three-fourths near threshold). Again accounting for dead time √ √ of tdead = tspike , a 1/ N argument yields CV = .83 16 = .2. With variablesized PSPs, this estimate should be adjusted upward by a factor of roughly √ 2√yielding CV ≈ .28 (Softky & Koch, 1993; Stein, 1967). The reason that 1/ N arguments fail to estimate CV accurately will be discussed shortly. We measured the dependence of CV on postsynaptic firing rate by varying input rates while keeping the inhibition-to-excitation ratio fixed at R = .75 (see Figure 2A). Physiologically reasonable levels of CV are obtained across fiting rates. Also, CV decreases at higher firing rates, as observed experimentally (Softky & Koch, 1993, Fig. 3). To test the notion the a balance of excitation and inhibition is required to achieve high variability (Shadlen & Newsome, 1994; Bell et al., 1995), we varied the inhibition ratio R and observed ISI variability at input rates that yield postsynaptic firing rates between 95 and 105 Hz (see Figure 2B). To compare more closely with previous models that have used low gain,we also ran simulations with Vreset = Vrest . Balanced inhibition models are those with large R (e.g., R = 1.25),11 while the low-gain model with excitation only (R = 0, marked “O”) is similar to the simple EPSP integrator discussed in Softky and Koch (1993). The model with physiological (high) gain gives physiological CV over a broad range of R. In contrast, balanced inhibition (R > 1) is necessary to achieve CV > .5 in the low-gain model. 4 An Intuitive Picture An intuitive grasp of spike variability in simple integrate-and-fire models can be obtained by roughly dividing the ISI into three regimes (see Figure 3A; see also Abeles, 1991, Chap. 4; Smith, 1992). In the initial refractory regime, the state of the neuron√is dominated by he recovery from the previous spike. In this regime 1/ N arguments are valid (Smith, 1992), and spiking is regular. In the final, steady-state regime, the memory of the last spike has decayed and random synaptic variation causes the cell to fluctuate about
10 For a Poisson process with dead time, the shortest interval, the dead time, and the peak of the ISI distribution are the same. 11 Shadlen and Newsome (1994) used V reset = Vrest and hence considered a low-gain balanced inhibition model.
978
Todd W. Troyer and Kenneth D. Miller
Figure 2: (A) CV versus output firing rate, for fixed inhibition ratio R = 0.75. Data shown are mean ± standard deviation for 10 simulations (using different Poisson trains of inputs) of 10 sec duration. λex ranged from 3396 Hz to 20,509 Hz, while λin correspondingly varied from 1274 Hz to 7691 Hz. The thin line shows CV for Poisson process with dead time equal to the shortest ISI observed. With increasing firing rate, the refractory period is a greater fraction of a typical ISI. Thus CV decreases with increasing rate. This effect is also seen in the biological data (Softky & Koch, 1993, Fig. 3). (B) CV versus inhibition ratio R for high gain (solid line; Vreset = −60 mV) and low gain (dashed line; Vreset = Vrest ) models. R = 1.25 corresponds to example balanced inhibition models. Low gain with R = 0 corresponds to the EPSP integrator model (marked “O”). At a given level of R, values of λex and λin were found for which the 95 percent confidence interval for the mean spike rate was within the range from 95 to 105 Hz (10 simulations of 10 sec each). Data shown are from 10 subsequent trials at these input rates. λex ranged from 4448 Hz to 19,584 Hz for high gain and from 7500 Hz to 23,261 Hz for low gain; λin varied from 0 Hz to 12,240 Hz (high gain) and from 0 Hz to 14,538 Hz (low gain). The dotted line shows CV = .5.
some steady-state mean voltage (necessarily subthreshold). Thus, threshold crossings are equally likely to occur in any small interval and spike statistics are Poisson. In the intermediate regime, the mean voltage is still significantly rising, yet typical voltage fluctuations can be sufficient to cross threshold. √ Spike variability in this regime should be a mixture of 1/ N integration effects and Poisson random crossings. ISI variability depends on the relative amount of time spent in the regimes. As shown in Figure 3B, integration in the EPSP integrator model is primarily in the refractory regime, and hence spiking is regular. In the low-gain balanced-inhibition model, large, transient changes in the balance of excitation and inhibition are large enough to dominate the cell’s refractoriness
Variability in a Cortical Regular Spiking Cell
979
Figure 3: Mean ± standard deviation of the distribution of voltages. (A) Schematic showing three regimes of behavior (see text). (B) Data obtained from repeated (n = 10,000) experiments starting the model (without spiking) from Vreset . Rates set to achieve firing rates of 100 (±5) Hz. Solid lines are from the high-gain model (Vreset = −60 mV; R = .75; λex = 8885 Hz; λin = 3332 Hz). Dashed lines are low-gain models (Vreset = −74 mV = Vrest ). Dashed lines are from the EPSP integrator (R = 0; λex = 7500 Hz; λin = 0 Hz). Dotted lines are from a low-gain, balanced-inhibition model (R = 1.25; λex = 23,261 Hz; λin = 14,538 Hz). The arrow points to mean ISI.
rather quickly, and the cell spends most of its time in the intermediate and steady-state regimes. Note that this recovery is aided by a reduction of the membrane time constant due to the large synaptic conductance. The small reset of the high-gain model causes the cell to spend most of the time in the intermediate or steady-state regimes for virtually any input combinations that yield biologically reasonable firing rates. Note that any mechanism that acts to reduce the importance of the transient, refractory regime contributes to variable spiking. In our model, Vreset is the dominant parameter determining the significance of the refractory regime. However, reducing the membrane time constant τ shortens transients and can emphasize steady-state behavior. Increasing the variability of the synaptic current results in a more rapid transition to the intermediate regime and hence will also increase spike variability. Thus balanced inhibition, which reduces τ and increases input variability, and high-gain mechanisms are independent effects that can contribute to variable spiking. It has been suggested (Bell et al., 1995) that spike variability depends on the cell’s “hovering” near threshold, so that the integration of only a few inputs can cause a spike. While this is similar to our explanation of steadystate behavior, there is a significant difference. In the steady-state regime, threshold crossings result from random fluctuations in the input current,
980
Todd W. Troyer and Kenneth D. Miller
so spiking is always Poisson, regardless of steady-state voltage level. Thus, the nearness to threshold in the steady-state regime affects spike rate but does not directly affect spike variability. 5 Discussion We have shown that matching simple integrate-and-fire neurons to experimental data results in ISI distributions with physiological CV. In additional studies, we have found that high CV is also produced by more realistic integrate-and-fire models incorporating the finite time course of synaptic conductances, realistic combinations of fast and slow excitatory synapses (AMPA and NMDA), and spike rate adaptation. Lengthening the synaptic time course can result in input correlations longer than the shortest ISIs and lead to burstlike behavior that can actually increase CV; a realistic mixture of fast and slow synapses avoids such burstiness while achieving high CV.12 Does an understanding of the intermediate and steady-state regimes have implications for neural coding? In these regimes, both the mean synaptic current and the variance in this current contribute to neuronal spiking. A neuron operating in these regimes is certainly capable of rate coding— coding mean input rates with a mean output rate. When inputs are Poisson, the mean input rate of excitatory and inhibitory events determines both the mean and the variance of the total synaptic current and consequently the mean output rate. However, such a neuron is also capable, in principle, of responding to input fluctuations with high temporal resolution. The possible contributions of these different forms of coding will depend on signal-tonoise considerations—how reliably a given statistical attribute of the input can be transmitted and be distinguished from random input fluctuations— that will depend on the details of the statistics of the neuronal input. Thus, the fact of high CV alone does not serve to distinguish between “rate coding” and “coincidence detection” or “spike timing” models of neural encoding. One difficulty with high-gain models is that they display low dynamic range; relatively small fluctuations in input current can lead to saturated outputs. In other studies (Troyer & Miller, 1995), we have found that spike rate adaptation helps to solve this problem by providing negative feedback that reduces gain on longer time scales, while preserving sensitivity to small input fluctuations on the time scale of typical interspike intervals. This fits well with the empirical observations that all excitatory cortical neurons show spike rate adaptation, while most inhibitory cortical neurons do not adapt but can sustain high rates of firing without saturating (McCormick et al., 1985).
12 The burstlike behavior gives an unrealistically low CV2, a measure of variability that compares only adjacent interspike intervals (Holt, Softky, Koch, & Douglas, 1996). The mix of fast and slow synapses gives a realistically high CV2.
Variability in a Cortical Regular Spiking Cell
981
Our high-gain model results in ISI variability that falls in the lower half of the range .5 to 1 reported in Softky and Koch (1993). If one includes the refractory period, even a Poisson model cannot account for the upper reaches of these data. The most likely explanation is that cortical ISI variability results from a combination of mechanisms (including high gain) that have generally been explored in isolation. However, ISI variability alone is inadequate to distinguish the relative contributions of these mechanisms to cortical integration. Stronger constraints may be found from experiments designed to probe the biophysical basis of integration and spiking in cortical cells and from further statistical studies. The latter might examine, in vivo, the statistics of synaptically driven voltage and current fluctuations as well as spike train statistics beyond CV, and in vitro, the dependence of firing rate and CV on both mean and variance of white noise input current. One recent technique that may shed light on the contribution of cortical inhibition is the ability to block inhibitory input to a single cell pharmacologically (Nelson, Toth, Sheth, & Sur, 1994). To understand cortical ISI variability, classical notions of neuronal integration must be modified, not by abandoning the integrate-and-fire neuron but by deepening our understanding of the dynamics of such model neurons. In particular, the common intuition of dynamics dominated by recov√ ery, with 1/ N integration behavior, must be supplemented by an understanding of a regime in which spike behavior is Poisson. Such a regime has been well studied in voltage-based integrate-and-fire models using a variety of statistical approaches (Gerstein & Mandelbrot, 1964; Stein, 1967; Smith, 1992; Tuckwell, 1988, Chap. 9); in certain limits the regime can be described as an unbiased random walk of voltage with decay. The importance of such a random, steady-state regime for cortical processing has been discussed by several previous authors (Abeles, 1991; Shadlen & Newsome, 1994; Bell et al., 1995). Our contributions are to make specific links between refractoriness and neuronal gain and to demonstrate that an integrate-and-fire model fitted to known cortical physiology will operate in the intermediate and steady-state regimes for physiologically reasonable firing rates and a wide range of ratios of inhibition to excitation. In these regimes, variable spike statistics dominate, and hence the model produces physiological CV. Thus, in contrast to the intuition presented in Softky and Koch (1993), one should expect a high degree of ISI variability from cortical neurons. Of course, a full biophysical understanding of cortical integration may yet require abandonment of this simple view of cortical processing, but the fact of cortical ISI variability alone does not. Acknowledgments We thank Larry Abbott, Paul Bush, Mark Kvale, Anton Krukowski, Neal Hessler, and Svilen Tzonev for helpful comments on this article. This work
982
Todd W. Troyer and Kenneth D. Miller
was supported by an NSF Mathematical Sciences Postdoctoral Research Fellowship (T.W.T.) and by a Whitaker Foundation Biomedical Engineering Research Grant, the Searles Scholars’ Program, and the Lucille P. Markey Charitable Trust (K.D.M.). References Abeles, M. (1982). Role of the cortical neuron: Integrator or coincidence detector? Israel J. Med. Sci., 18, 83–92. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Alonso, J. M., Usrey, W. M. & Reid, R. C. (1995). Precisely correlated firing in cells of the lateral geniculate nucleus. Nature, 383, 815–819. Bell, A. J., Mainen, Z. F., Tsodyks, M., & Sejnowski, T. J. (1995). Balanced conductances may explain irregular cortical spiking (Tech. Rep. No. INC-9502). San Diego, CA: Institute for Neural Computation, University of California at San Diego. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Comput. Neurosci., 3, 7–34. Holt, G. R., Softky, W. R., Koch, C., & Douglas, R. J. (1996). A comparison of discharge variability in vitro and in vivo in cat visual cortex neurons. J. Neurophsiol., 75(5), 1806–1814. McCormick, D. A., Connors, B. W., Lighthall, J. W., & Prince, D. A. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. J. Neurophysiol., 54, 782–805. Nelson, S., Toth, L., Sheth, B., & Sur, M. (1994). Orientation selectivity of cortical neurons during intracellular blockage of inhibition. Science, 265, 774–777. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Smith, C. E. (1992). A heuristic approach to stochastic models of single neurons. In T. McKenna, J. Davis, & S. F. Zornetzer (Eds.), Single Neuron Computation (pp. 561–588). San Diego, CA: Academic Press. Softky, W., & Koch, C. (1993). The highly irregular firing of cortical-cells is inconsistent with temporal integration of random EPSPS. J. Neurosci., 13, 334–350. Stein, R. B. (1967). The information capacity of nerve cells using a frequency code. Biophys. J., 7, 797–826. Troyer, T. W., & Miller, K. D. (1995). A simple model of cortical excitatory cells linking NMDA-mediated currents, ISI variability, spike repolarization and slow AHPs. Soc. Neuro. Abs., 21, 1653. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Usher, M., Stemmler, M., Koch, C., & Olami, Z. (1994). Network amplification of local fluctuations causes high spike rate variability, fractal firing patterns and oscillatory local field potentials. Neural Comp., 6, 795–836.
Variability in a Cortical Regular Spiking Cell
983
Wilbur, W. J., & Rinzel, J. (1983). A theoretical basis for large coefficient of variation and bimodality in neuronal interspike interval distributions. J. Theor. Biol., 105, 345–368.
Received January 12, 1996; accepted July 29, 1996.
Communicated by Anthony Bell
Role of Temporal Integration and Fluctuation Detection in the Highly Irregular Firing of a Leaky Integrator Neuron Model with Partial Reset Guido Bugmann School of Computing, University of Plymouth, Plymouth PL4 8AA, UK
Chris Christodoulou Department of Electronic and Electrical Engineering, King’s College London, Strand, London WC2R 2LS, UK
John G. Taylor Department of Mathematics, King’s College London, Strand, London WC2R 2LS, UK Institut fur Medizin, Kfa-Juelich, Juelich, Germany
Partial reset is a simple and powerful tool for controlling the irregularity of spike trains fired by a leaky integrator neuron model with random inputs. In particular, a single neuron model with a realistic membrane time constant of 10 ms can reproduce the highly irregular firing of cortical neurons reported by Softky and Koch (1993). In this article, the mechanisms by which partial reset affects the firing pattern are investigated. It is shown theoretically that partial reset is equivalent to the use of a time-dependent threshold, similar to a technique proposed by Wilbur and Rinzel (1983) to produce high irregularity. This equivalent model allows establishing that temporal integration and fluctuation detection can coexist and cooperate to cause highly irregular firing. This study also reveals that reverse correlation curves cannot be used reliably to assess the causes of firing. For instance, they do not reveal temporal integration when it takes place. Further, the peak near time zero does not always indicate coincidence detection. An alternative qualitative method is proposed here for that later purpose. Finally, it is noted that as the reset becomes weaker, the firing pattern shows a progressive transition from regular firing, to random, to temporally clustered, and eventually to bursting firing. Concurrently the slope of the transfer function increases. Thus, simulations suggest a correlation between high gain and highly irregular firing. 1 Introduction It has been reported by Softky and Koch (1992, 1993) and Bugmann (1990) that the highly irregular firing of cortical neurons cannot be reproduced by a single neuron performing the temporal integration of excitatory postNeural Computation 9, 985–1000 (1997)
c 1997 Massachusetts Institute of Technology °
986
Guido Bugmann, Chris Christodoulou, and John G. Taylor
synaptic potentials (EPSP) generated by independent stochastic input spike trains. They found that only simulations with biologically unrealistic short membrane integration time constants of RC < 1 ms allowed reproduction of the observed irregularity. While irregular firing at low frequency is commonly obtainable with long-membrane time constants (see, e.g., Bugmann, 1991a), the specific problem posed by the data of Softky and Koch (1992, 1993) was the high irregularity at high frequency. This result has triggered investigations into alternative ways of producing irregular spike trains, for example, by exploiting dynamic properties of networks of spiking neurons (Usher, Stemmler, Koch, & Olami, 1994; Zipser, Kehoe, Littlewort, & Fuster, 1993) or exploiting balanced excitatory and inhibitory inputs (Shadlen & Newsome, 1994; Bell, Mainen, Tsodyks, & Sejnowski, 1995). Recently Bugmann and Taylor (1994) and Bell et al. (1995) have provided reminders that incomplete repolarization (or partial reset) of the membrane causes highly irregular firing (for older references to that approach, see Lansky & Musila, 1991). The combination of balancing excitation and inhibition and partial reset was advocated by Tsodyks and Sejnowski (1995). So far, however, it has not been shown that any of these mechanisms allows reproducing the dependence between the irregularity and firing rate shown by Softky and Koch (1993). This article concentrates on the partial reset technique and demonstrates that it enables a single neuron with a realistic membrane time constant of 10 ms to reproduce the highly irregular firing of cortical neurons. It is shown theoretically that partial reset is equivalent to the use of a time-dependent threshold, similarly to a technique proposed by Wilbur and Rinzel (1983) for producing high irregularity. This equivalent model allows establishing that temporal integration and fluctuation detection can coexist and cause irregular firing. It is shown that, on one hand, the potential of the membrane results from the progressive accumulation of input currents, while on the other hand, current peaks are the events causing firing, because the firing threshold remains slightly above the rising potential. Reverse correlation curves are analyzed for various strengths of reset in an attempt to establish the causes of firing. However, it is found that they do not reveal temporal integration, and peaks near time zero are not reliable indicators of coincidence detection. An alternative qualitative method is proposed for that later purpose. 2 Partial Reset and the Control of the Firing Irregularity Partial reset is a mechanism by which an output spike does not completely reset the membrane potential of a neuron model (Shigematsu, Akiyama, & Matsumoto, 1992). In our simulations of a leaky-integrate-and-fire (LIF) neuron (see a description of that model in Softky & Koch, 1993, and in appendix A to this article), an output spike fired at time t resets the potential of the capacitor to V(t) = βVth, where Vth is the firing threshold and β is
Temporal Integration and Fluctuation Detection
987
the reset parameter, with a value between 0 and 1. By comparing the results from Rospars and Lansky (1993) and those of Christodoulou, Clarkson, Bugmann, and Taylor (1994) who found CV > 1 when no resetting was used with the results of Softky and Koch (1993) who found CV < 1 for total reset (i.e., β = 0), it was inferred that partial reset may allow a fine control of the irregularity of the spike trains. This is confirmed in Figure 1, which illustrates that CV can be varied over a wide range by varying the reset parameter β. Partial reset in a LIF neuron most likely corresponds to partial repolarization of the somatic membrane in more detailed models (Bell et al., 1995), but it could also reflect a true but incomplete electrical decoupling between axonal trigger zone and neuronal soma (Bras, Gogan, & Tyc-Dumont, 1988) or between soma and dendrites (Rospars & Lansky, 1993) or be merely a computationally simple way to reproduce the effects of more complex dynamical properties of the membrane (see, e.g., Carpenter, 1979; Lansky & Musila, 1991). 3 Equivalence Between Partial Reset and Time-Varying Threshold A LIF neuron with constant threshold and partial reset is equivalent to a LIF neuron with total reset and a time-varying threshold. This is briefly demonstrated analytically in appendix A and is illustrated in Figure 2. After a spike is produced, partial reset initially sets the potential V(t) of the capacitor to βVth. If there is no subsequent input, the potential decays according to the curve Vo(t). If there are input spikes, the potential increase due to the inputs builds up on top of the decaying Vo(t) potential. The resulting total potential V(t) is overall relatively constant, and firing appears to be controlled by maxima of potential fluctuations, usually attributed to coincidences of input spikes. Another picture emerges when the potential Vo(t) is subtracted from V(t) and from the threshold, an operation that, according to the theory, results in an equivalent model. In the equivalent model, the potential V’(t) of the soma now shows the usual integration-and-reset cycle seen in LIF models with total reset. In contrast to the constant threshold used in the classical LIF model, the threshold is now increasing with time, according to Vth(t) = Vth − Vo(t) = Vth(1 − β exp −(t/RC)). In Figure 2, this threshold stays closely above the membrane potential while both increase almost in parallel. 4 Determinants of the Firing Time How is the firing time determined in a LIF model with partial reset ? Of special interest is a case of high irregularity and high frequency, for instance, the point T = 10 ms and CV = 0.87 in the curve for β = 0.91 in Figure 1, which is characteristic of the biological data and could not be reproduced in simulations by Softky and Koch (1993). For comparison, points with T =
988
Guido Bugmann, Chris Christodoulou, and John G. Taylor
Figure 1: Dependence of the coefficients of variation of the interspike intervals CV on the average interspike interval T, for three values of the reset parameter β. The symbols show the effects of partial reset on the irregularity of spike trains produced by an LIF neuron simulated with 1 ms integration time steps. The coefficients of variation obtained with β = 0.91 (square symbols) are very similar to those observed in cortical neurons (Figure 9 in Softky & Koch, 1993). Each point is based on 20,000 simulation time steps. As inputs, we used 50 random spike trains without refractory time. All inputs were excitatory and had the same average firing frequency, increased step by step from 150 Hz to 400 Hz, to vary the average output interspike interval T. The membrane potential was increased stepwise by 0.16 mV by each input spike assumed to be of zero width. Between spikes, the potential decreased with a decay time constant of 10 ms (13 ms in Softky & Koch, 1993). During the refractory time, integration of inputs continued, but the value of the potential was not compared with the firing threshold Vth = 15 mV, as in Christodoulou et al. (1994). The full line shows the theoretical curve for a random spike p train with discrete time steps 1t √ and a refractory time of Tr time steps: CV(T) = σ 2 (T)/T2 = 1 − α/(1 + αTr). This expression was developed assuming that in each time step, except the Tr time steps following a spike, there is a probability α of observing a spike (Bugmann, 1995). For the plotted curve and the simulations, we used 1t = 1 ms and Tr = 1 ms (causing a minimum interspike interval Tr + 1t = 2 ms). The actual average interspike interval T is related to α by T = (1 + αTr)/α.
10 ms on the curves for the two other values of β are also discussed. It can be seen in Figure 2 that the potential V0 (t) of the membrane (of the equivalent model) increases progressively, which is, in the commonly accepted sense, the result of the temporal integration of past input spikes. However, due to the small distance between firing threshold and the potential, potential
Temporal Integration and Fluctuation Detection
989
(A)
(B)
Figure 2: Membrane potentials of the LIF model with partial reset (A, gray lines) and its equivalent model with total reset (B, black lines). The curve V(t) is observed in the case β = 0.91 with an average interspike interval of Tf = 10 ms. The simulation is done assuming that a spike has just been fired so that the initial potential is βVth. The output spike train is very irregular and has a CV = 0.87. The curve Vc(t) is obtained with a continuous input current equal to the average input current in the case of the curve V(t). The time taken to reach the threshold is now Tc = 17 ms. This shows that a fluctuating input current leads to higher firing rates. For comparison, in the case β = 0.98 one finds Tc = infinity but Tf = 10 ms, and in the case β = 0, Tc = Tf = 10 ms. The curves Vo(t) and Vth(t) are explained in the text. The curve V0 (t) is calculated from V(t) and Vo(t). The input current corresponds to the total number of input spikes in each time step. The simulation time step is 0.1 ms.
990
Guido Bugmann, Chris Christodoulou, and John G. Taylor
fluctuations are more likely to cause early firing and can thus become the actual causes of firing. The importance of potential fluctuations for the firing behavior of the neuron is reflected in the fact that the average interspike interval with a fluctuating input, Tf , is shorter than the interspike interval Tc observed with a continuous input current of the same average value.1 For instance, in the case β = 0, the fluctuations seem to play no role, as Tf = Tc = 10 ms. In the case β = 0.98, when Tf = 10 ms, the potential cannot reach the firing threshold with the continuous input current (Tc = infinity); therefore, fluctuations are essential for firing. In the case β = 0.91, the fluctuations play a significant role, as most spikes are produced early (Tc = 17 ms and Tf = 10 ms). Therefore, although the potential of the neuron is the result of a standard temporal integration process, the fluctuations of that potential are determining the time of firing due to the partial reset mechanism or, equivalently, the use of a time-dependent threshold. 5 What Do Reverse Correlation Graphs Tell Us? What is the time scale over which input current fluctuations (fluctuations of the density of input spikes along the time axis) are integrated to result in potential fluctuations that cause firing? To answer such a question, one usually turns to reverse correlation graphs, which can “indicate the length of stimulus history that is relevant” (Mainen & Sejnowski, 1995). This interpretation of reverse correlation curves requires some caution (see also Shadlen & Newsome, 1995). For instance, in Figure 3, the curve for β = 0 is almost perfectly flat for times earlier than −0.5 ms, although it is established that in this case, every input spike arriving during an average of 10 ms before firing is relevant, as firing results from temporal integration of small EPSPs (Softky & Koch, 1992, 1993). An indication on the length of relevant stimulus history is provided only if the temporal organization of the inputs is important for firing. In the case β = 0, the timing is not important, and the graph does not show any feature, even if spikes are important for the firing process. In the curve for β = 0, the peak at time zero of amplitude 611 Hz is due to the fact that a LIF neuron can fire only when the instantaneous input current is above current threshold (Bugmann, 1991b). In models with EPSPs extended in time, for example, modeled with alpha functions (Bugmann, 1995), such a current level results from inputs having occurred some time in the past. However, in the model used here, where spikes cause an instantaneous increase in potential (of 0.16 mV), at least one spike must be present during a time step for firing to be possible during that time step. Therefore, the average input firing rate measured by the reverse correlation contains
1 Each input spike deposits a fixed electric charge 1Q = V ∗ C (see appendix A) in the 1 capacitor, so that a spike trains of frequency f “carries” an average current IAV = f ∗ 1Q.
Temporal Integration and Fluctuation Detection
991
Figure 3: Reverse correlation graphs for the three indicated values of the reset parameter β. The average interspike interval is T = 10 ms in all three cases. The graphs represent input firing frequencies based on number of input spikes observed in 0.1-ms time steps, at the indicated delay before and after a spike is produced. The frequencies stay at the same levels on both sides up to +15 ms and −15 ms. The fine lines indicate the level of the average input frequencies. Lower average input firing rates are needed as β increases (295 Hz for β = 0, 189 Hz for β = 0.91, and 178 Hz for β = 0.98) because partial reset removes fewer charges after each spike than total reset does. Amplitude of the peak at time zero: 611 Hz for β = 0, 546 Hz for β = 0.91, and 455 Hz for β = 0.98. Data are averaged over 10,000 output spikes (100 sec simulated operation). Decay time constant RC = 10 ms.
only time steps with one or more spikes for the bin at time zero. This results in large values of the input firing rate. For instance, if one uses only time steps with one or more spikes, the calculated frequency is 380 Hz, while the average input frequency is 295 Hz. If only time steps with two or more spikes are used, the calculations give 673 Hz. Thus, the amplitude of 611 Hz of the central peak indicates that firing mostly occurs during time steps with two or more input spikes. This sampling effect results in even higher peaks at time zero, when smaller simulation time steps are used. This effect, added to the fact that the central peak is observed for all three values of β, strongly suggests that the central peak does not have any computational meaning. For instance, for β = 0, temporal integration determines the time of firing, while in the case β = 0.98, potential fluctuations determine the time of firing.
992
Guido Bugmann, Chris Christodoulou, and John G. Taylor
In the curve for β = 0, in the three to four time steps before zero, there is a small accumulation of spikes, which indicates that, on average, a spike is produced just after a small peak in input current (due to a higher-thanaverage arrival rate of input spikes) lasting approximately 0.5 ms. In the case β = 0, such an accumulation cannot be taken as evidence that the neuron is technically a coincidence detector in the sense that only coincidences within a small time window of 0.5 ms cause firing. It merely reflects the fact that in case of a fluctuating input current, the potential of an LIF neuron usually crosses the threshold during a positive fluctuation. This fluctuation does not need to be large in the case β = 0 where the saturation potential for an equivalent continuous input current is 24 mV and the potential is rising steeply when crossing the firing threshold. Actually the comparison of Tc with Tf done earlier shows that fluctuations are not needed at all, because the neuron would fire at the same rate when fed with an average input current. In the case β = 0.91, the saturation potential is slightly above the threshold, but at time T = 10 ms, the average potential is still below threshold at 14.4 mV (starting with a reset potential βVth = 13.65 mV). Thus potential fluctuations of sufficient amplitude are required to cause the observed early firing. This is reflected in the reverse correlation curve, where the accumulation before time zero is of larger amplitude and longer duration than in the case β = 0, indicating that firing is usually preceded by current peaks lasting up to 2 ms.2 In this case where the input spikes arrive at an average rate of 0.95 per bin (0.1 ms), current peaks are not characterized by a clean current plateau of given duration. These are probably best described as clusters of spikes, two of which can be seen in Figure 2 before each of the two output spikes. In the case of partial reset, individual spikes have potentially a higher chance to cause firing because the membrane potential is close to the firing threshold. However, the width of the reverse correlation curve indicates that clusters of various duration are contributing to the firing. Thus, partial reset is not equivalent to the use of the giant EPSPs of over 10 mV proposed in Softky and Koch (1993). The earlier analysis of the firing times indicates that this neuron has probably a coincidence detection 2 To determine this value, the curve was fitted by a simple model based on the following assumptions: Input current peaks are defined as having an amplitude larger than the average current and a certain duration. The larger the amplitude, the smaller the necessary duration until firing occurs. Before the peak, the current has the average value. It is assumed that the reverse correlation curve represents an average over all peaks that can cause firing. For instance, current peaks of large amplitude contribute only to the parts of the curve near time zero. The peaks with smaller amplitude must be of longer duration and contribute also to the tail of the curve. Such a model provides a good fit to the data only if one assumes that peaks cannot be of duration larger than a given value (this value is approximately 2 ms for β = 0.91, 0.5 ms for β = 0, and 3 ms for β = 0.98). Somehow peaks of long duration and small amplitude seem not to contribute to firing. Details of the model will be given in a future publication.
Temporal Integration and Fluctuation Detection
993
function, where a coincidence is any cluster containing a sufficient number of spikes in a given time window. However, the reverse correlation curve cannot confirm this hypothesis. As seen in the case β = 0, the fact that firing usually occurs after a peak of above average input current cannot be taken as evidence that the neuron is “detecting” this type of peak. In the case β = 0, many other high current peaks are occurring at random times before firing, do not cause firing, and do not appear in the reverse correlation curve.
6 Proving Coincidence Detection To prove formally that the neuron operates as a coincidence detector, it has to be shown that most current peaks of the same size as those seen before a spike actually cause a spike. A theoretical proof is a complex task beyond the scope of this article. However, it can be checked by simulation if current peaks systematically cause firing. For that purpose, simulations were performed with two output neurons receiving the same input spikes, the first being the LIF neuron with partial reset and the second acting as a peak detector. The second LIF neuron has a very short decay time constant of 2 ms (this duration is taken from the reverse correlation curve for β = 0.91), and its potential is not reset after a spike. Due to the fast decay, potential peaks reflect the above-average rate of arrival of input spikes during a preceding time window of up to 2 ms. By placing the firing threshold of this neuron adequately (3.8 mV), its spikes become indicators of peaks in input current. Figure 4 shows that the two neurons fire spikes often at the same time or within 1–2 ms of each other. Although this is not a formal proof, it strongly suggests that the LIF neuron with partial reset β = 0.91 operate mainly as a detector of peaks of input current of duration less than 2 ms. Because such events are random, this property is the cause of values of CV close to one. This small time window of 2 ms and the similarity between the fluctuating portions of two potential curves in full line (see Fig. 4) suggest that the neuron may operate with a small effective time constant. This is, however, not the case. There is no inhibitory leak in this model that could have such an effect (Softky, 1995). Further, it is shown in appendix B that when the potential is maintained close to its saturation value by a continuous input current, a small increase in potential decays toward the saturation potential with a time constant of 10 ms. Two factors cause the similarity between the two curves: (1) The model uses a step increase of the potential for each input spike, and so there is no difference in the rising phases in both neurons; and (2) the reset cuts through the maxima in the potential of the capacitor with RC = 10 ms and moves the decreasing parts of potential below the threshold. These are of similar amplitude in both curves, indicating a similar leak current in both neurons. This is due to the difference in average potentials, despite different decay time constants.
994
Guido Bugmann, Chris Christodoulou, and John G. Taylor
Figure 4: Membrane potentials and firing times of two neurons with the same input spike trains simulated during 300 ms. The upper row of spike times and the upper membrane potential (in the full line) belong to a LIF neuron with partial reset (β = 0.91) and a membrane time constant of RC = 10 ms. The firing threshold is Vth = 15 mV. The dotted line represents the free evolution of the potential with neither spiking nor resetting. The lower row of spike times and the lower membrane potential belongs to a LIF neuron with a membrane decay time constant of RC = 2 ms and a firing threshold Vth = 3.8 mV. This neuron is not reset after a spike and is used to detect peaks of input current or, equivalently, peaks of spike input rate. Further simulation conditions as in Figure 2.
7 Temporally Clustered Firing and Neuronal Gain Values of CV larger than one are caused by a temporal clustering process (e.g., bursting) that results from a smooth transition from the peak detection regime described above into a regime where multiple spikes are generated in response to peaks (single responses to peaks can only produce values of CV up to 1). As an indicator of clustering, there is an interesting peak at approximately −2 ms in the reverse correlation curve for β = 0.98 (see Figure 3). The interpretation of this peak requires several logical steps. To start, it is not sensible to take this peak as evidence that an input current peak occurs frequently 2 ms before a spike is fired. More plausible is the assumption that the peak has caused a first spike followed by a second one 2 ms later. There is no current peak at t = 2 ms in the reverse correlation curve, indicating that the first spike is generally not followed by an increased input current 2 ms later. Thereby the accumulation of current just before time zero does not occur in the same run as the accumulation before the spike 2 ms before. The picture that emerges is that either a first spike is followed by a second spike 2 ms later, which is not caused by a peak in input current, or an isolated spike is produced by a current peak. The simplest interpretation is that a current peak produces either a single spike or a pair of spikes
Temporal Integration and Fluctuation Detection
995
separated by the minimum interspike interval (for β = 0.98, the reverse correlation curve shows only a faint indication for triplets, which becomes much stronger for β = 0.995 [not shown]). Two effects favor clustered firing: (1) the small reset enables long-lasting current peaks to cause repeated firing, and (2) a decreasing average potential favors a bimodal firing pattern. For instance, in the case β = 0.98, the reset potential is initially 0.3 mV below threshold but then the potential decreases towards the saturation potential, which is 0.8 mV below threshold. Thus, it is just after a spike that the firing conditions are the most favorable; however, the more the firing is delayed after that initial period, the higher the current peaks required and the longer the average time to the next peak.3 While the first effect causes generally clear bursts of spikes, the second causes a more discrete clustered firing. Interspike interval distribution curves (ISI) show a transition from regular, to random, to clustered, and eventually to bursting firing (Bugmann, 1995) when the value of β increases. This transition can be described as a progressive replacement of single spikes with regular interspike intervals by clusters with a larger and larger number of spikes per cluster, a process that enhances the ISIs at small intervals and depletes them at large intervals. Such a process increases the value of CV (Bugmann, 1995). Clusters can also be seen as multiple responses to favorable input events, which increase the gain of the neuron, as illustrated in Figure 5.4 The higher gain observed with high values of β can also be inferred from the small change in average potential that is required to cause a large change in firing rate (see, e.g., Figure 4). Thus, simulations suggest a correlation between high gain and highly irregular firing. 8 Summary The effects of strength of reset on the irregularity of firing can be summarized as follows: 1. With total reset, for a spike to be produced after 10 ms, the average potential must rise steeply during the interspike interval. This favors the late production of spikes with relatively regular interspike intervals. When the potential crosses the threshold, fluctuations play a minor role in determining the exact time of firing. This explains a small nonzero CV for the model with total reset.5
3 Firing is possible only after the refractory time (2 ms in this case), so the average potential at that time is relevant for fluctuation detection. 4 This is consistent with the observation that in bursting neurons, information is carried by the rate of bursts rather than the rate of spikes (Bair, Koch, Newsome, & Britten, 1994). 5 When the average potential input current approaches the threshold very slowly, fluctuations become the main cause of firing. Therefore, neuron models with total reset also show high values of CV at low frequency.
996
Guido Bugmann, Chris Christodoulou, and John G. Taylor
Figure 5: Transfer function of the LIF neuron model with various values of the reset parameter β. With β = 0 our model is equivalent to the one used in Softky and Koch (1993). The theoretical current threshold is 1.5, but due to the long integration time steps used (1 ms) the model underestimates leaks and shows a smaller threshold of 1.41. Firing below that threshold is due to fluctuations. Note that partial reset modifies the gain and not the current threshold. Simulation conditions as in Figure 1.
2. With partial reset and a reset parameter β = 0.91, the average potential remains relatively constant and close to the threshold, so that random firing results from the detection of current peaks. The amplitude of reset can be seen as a way of selecting the amplitude of the fluctuations that are able to cause firing. This also determines the average interspike interval. 3. With partial reset and a reset parameter β = 0.98, the potential is very close to the threshold after firing and repetitive firing occurs during periods of maintained high input current. Because each detected current peak can produce multiple spikes, to achieve an average interval of 10 ms, the number of detected current peaks must be small. For that purpose, a lower average potential is used, which causes long, silent periods between bursts. This bimodal firing pattern results in values of CV > 1. For a given average interspike interval, with strong reset, the duration of the interspike interval is determined by the time taken to integrate the input current. With weak reset, the random occurrence of current peaks determines the firing time. Currently, reverse correlation graphs do not
Temporal Integration and Fluctuation Detection
997
allow quantifying the contribution of each of these mechanisms in intermediate cases. While reverse correlation graphs show that current peaks of given sizes usually precede spikes, they do not indicate if the neuron was integrating current before the peak or if it was waiting for a peak of sufficient amplitude. Further analytical work is needed to determine if it is possible to extract this information. Probably a reverse correlation graph of the membrane potential would be more informative. Alternatively, useful indications can be obtained by comparing the firing times caused by a fluctuating current and by an equivalent continuous current, as done earlier in this article. In conclusion, highly irregular firing similar to the one observed in biological neurons can be produced with a simple leaky integrator model provided with a partial reset mechanism. This mechanism is a simple way to improve the realism of spike trains produced by simulated neurons. It is also a powerful parameter to control the gain of a neuron. Note that none of the mechanisms suggested by Softky and Koch (1993) is at work here: strong inhibitory leak, individual input spikes with high amplitude or strong dendritic amplification, or nonrandom synchronized inputs. By using partial reset, the temporal integration of random input spikes is exploited for maintaining the average potential of the neuron at a small distance from the threshold during the entire integration time, allowing input current fluctuations (made of small EPSPs) to cause firing at random times. Appendix A: Equivalence Between a Model with Partial Reset and a Model with Time-Dependent Threshold The demonstration has two steps. First, it is shown that the potential of the membrane in case of partial reset is the sum of a decaying reset potential Vo(t) and a potential Vs(t) due to the integration of input spikes in a neuron with total reset. Let us assume that the input current I(t) is made of very short impulses of duration 1t, each carrying a charge V1 ∗ C. It is then straightforward to show that the membrane potential V(t), which is a solution of the leaky integrator equation A.1, C
1 dV(t) =− V(t) + I(t) dt RC
(A.1)
has the following form: ¶ ¶ µ µ X t − tj t + V1 exp − V(t) = βVth exp − RC RC j = Vo(t) + Vs(t),
(A.2)
998
Guido Bugmann, Chris Christodoulou, and John G. Taylor
where tj is the time of arrival of an individual spike. When the reset is total, that is, β = 0, then V(t) = Vs(t), which shows that Vs(t) has the desired property (i.e., is the potential of a neuron with total reset). The second step is to show that the firing time is the same for the model with partial reset and constant threshold Vth and the model with total reset but varying threshold Vth(t). The neuron with partial reset fires an interval T after the last spike if the firing condition is satisfied: V(T) = Vth = Vo(T) + Vs(T).
(A.3)
This is equivalent to the firing condition describing the firing time in a model with total reset and a time-dependent threshold Vth(t): Vth(T) = Vth − Vo(T) = Vs(T).
(A.4)
Thereby it is shown that with the same input conditions, both models fire at the same time and are therefore equivalent. In the model of Wilbur and Rinzel (1983), the threshold increases with a function of shape tanh(t), while partial reset results in an exponential increase, from equations A.4 and A.2. Appendix B: Decay Time Constant for Fluctuations Let us assume that the input current is made of a continuous background Io to which a fluctuating component with average zero is added. The continuous component brings the potential of the neuron up to a saturation value Vo. A positive current fluctuation has brought the potential to Vo + 1v. At time t = 0, the current becomes equal to Io, and the potential decays. At time zero, the following equation describes the decay of the potential from a value Vo + 1v: Vo + 1V Io dV =− + . dt RC C
(B.1)
As Vo = Io ∗ R, equation B.2 simplifies to: 1V dV =− . dt RC
(B.2)
Equation B.2 describes the discharge of a capacitor with time constant RC and no external input current. If the continuous input current Io is also switched off at time zero, the capacitor would discharge at a larger rate (Vo + 1V)/RC. Acknowledgments We are thankful to Christian Lehmann, Sue McCabe, Mike Denham, and Jack Boitano for helpful discussions and comments on earlier versions of
Temporal Integration and Fluctuation Detection
999
this article. We are also very grateful to anonymous referees for their stimulating criticisms and pointers to the literature. One of us, C.C., gratefully acknowledges the support of the Leverhulme Trust for Grant F.40Z awarded to King’s College London.
References Bair, W., Koch, C., Newsome, W., & Britten, K. (1994). Power spectrum analysis of bursting cells in area MT in the behaving monkey. Journal of Neuroscience, 14, 2870–2892. Bell, A.J., Mainen, Z. F., Tsodyks, M., & Sejnowski, T. J. (1995). Balancing conductances may explain irregular cortical spiking (Tech. Rep. No. INC-9502). San Diego, CA: Institute for Neural Computation, University of California at San Diego. Bras, H., Gogan, P., & Tyc-Dumont, S. (1988). The mammalian central neuron is a complex computing device. Proc IEEE 1st Intl. Conf. on Neural Networks, San Diego, CA, 4, 123–126. Bugmann, G. (1990). Irregularity of natural spike trains simulated by an integrate-and-fire neuron. In Extended Abstract, 3rd International Symposium on Bioelectronics and Molecular Electronic Devices (pp. 105–106). Toranomon, Japan: FED. Bugmann, G. (1991a). Summation and multiplication: Two distinct operation domains of leaky integrate-and-fire neurons. Network: Computation in Neural Systems, 2, 489–509. Bugmann, G. (1991b). Neural information carried by one spike. Proc. 2nd Australian Conf. on Neural Networks (ACNN’91), Sydney, pp. 235–238. Bugmann, G. (1995). Controlling the irregularity of spike trains (Research Rep. NRG-95-04). Plymouth, UK: School of Computing, University of Plymouth. Bugmann, G., & Taylor, J. G. (1994). A top-down model for neuronal synchronisation (Research Rep. NRG-94-02). Plymouth, UK: School of Computing, University of Plymouth. Carpenter, G. A. (1979). Bursting phenomena in excitable membranes. SIAM J. Appl. Math., 36, 334–372. Christodoulou, C., Clarkson, T., Bugmann, G., & Taylor, J. G. (1994). Modelling of the high firing variability of real cortical neurons with the temporal noisy-leaky integrator neuron model. Proc. IEEE Int. Conf. on Neural Networks (ICNN’94), World Congress on Computational Intelligence (WCCI’94), Orlando, FL, pp. 2239–2244. Lansky, P., & Musila, M. (1991). Variable initial depolarisation in Stein’s neuronal model with synaptic reversal potentials. Biological Cybernetics, 64, 285–291. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Rospars, J. P., & Lansky, P. (1993). Stochastic neuron model without resetting of dendritic potential: Application to the olfactory system. Biological Cybernetics, 69, 283–294.
1000
Guido Bugmann, Chris Christodoulou, and John G. Taylor
Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4, 567–579. Shadlen, M. N., & Newsome, W. T. (1995). Is there a signal in noise? Current Opinion in Neurobiology, 5, 248–250. Shigematsu, Y., Akiyama, S., & Matsumoto, G. (1992). A spike-firing neural cell (SAM). Extended Abstracts, 4th Int. Symp. on Bioelectronic and Molecular Electronic Devices, Miyazaki, pp. 13–14. Softky, W. R. (1995). Simple code versus efficient code. Current Opinion in Neurobiology, 5, 239–247. Softky, W. R., & Koch, C. (1992). Cortical cells should fire regularly, but do not. Neural Computation, 4, 643–646. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neuroscience, 13, 334–350. Tsodyks, M. V., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network, 6, 111–124. Usher, M., Stemmler, M., Koch, C., & Olami, Z. (1994). Network amplification of local fluctuations causes high spike rate variability, fractal firing patterns and oscillatory local-field potentials. Neural Computation, 6, 795–836. Wilbur, A. J., & Rinzel, J. (1983). A theoretical basis for large coefficient of variation and bimodality in neuronal interspike distribution. Journal of Theoretical Biology, 105, 345–368. Zipser, D., Kehoe, B., Littlewort, G., & Fuster, J. (1993). A spiking network model of short-term active memory. J. Neuroscience, 13, 3406–3420.
Received July 14, 1995; accepted July 30, 1996.
Communicated by Anthony Zador
Shunting Inhibition Does Not Have a Divisive Effect on Firing Rates∗ Gary R. Holt Christof Koch Computation and Neural Systems Program, California Institute of Technology, Pasadena, CA 91125, U.S.A.
Shunting inhibition, a conductance increase with a reversal potential close to the resting potential of the cell, has been shown to have a divisive effect on subthreshold excitatory postsynaptic potential amplitudes. It has therefore been assumed to have the same divisive effect on firing rates. We show that shunting inhibition actually has a subtractive effect on the firing rate in most circumstances. Averaged over several interspike intervals, the spiking mechanism effectively clamps the somatic membrane potential to a value significantly above the resting potential, so that the current through the shunting conductance is approximately independent of the firing rate. This leads to a subtractive rather than a divisive effect. In addition, at distal synapses, shunting inhibition will also have an approximately subtractive effect if the excitatory conductance is not small compared to the inhibitory conductance. Therefore regulating a cell’s passive membrane conductance—for instance, via massive feedback—is not an adequate mechanism for normalizing or scaling its output. 1 Introduction Many neuronal models treat the output of a neuron as an analog value coded by the firing rate of a neuron. Often the analog value is thought of as what the somatic voltage would be if spikes were pharmacologically disabled (sometimes called a generator potential). This has led to a class of simplified single-compartment models where the steady-state above-threshold membrane potential is computed as V=
Isyn , G
∗ It has come to our attention that a previous study also pointed out that shunting synapses close to the spike mechanism have a subtractive effect (F. Gabbiani, J. Midtgaard, and T. Knoppel, ¨ Synaptic Integration in a Model of Cerebellar Granule Cells, J. Neurophysiol., 72: 999–1009, 1994).
Neural Computation 9, 1001–1013 (1997)
c 1997 Massachusetts Institute of Technology °
1002
Gary R. Holt and Christof Koch
Figure 1: A comparison of divisive and subtractive inhibition. (A) Divisive inhibition changes the slope of the input-output relationship. In this case, f = g(V) was a linear function of V and G was varied from 10 to 70 nS in equal steps. (B) Subtractive inhibition shifts the curves by subtracting a current. Here Iinh varies from 0.08 to 0.058 nA in equal steps.
where Isyn is the synaptic current and G is the input conductance. A firing rate is then computed directly from this above-threshold membrane potential: f = g(V), where g is some monotonic function—for example, fout ∝ V 2 in Carandini and Heeger (1994) or fout = tanh(V) in Hopfield (1984). Varying G—for instance, via activation of inhibitory input with a reversal potential close or equal to the cell’s resting potential (also known as silent or shunting inhibition)—will directly affect the generator potential V in a divisive manner. A recent and quite popular model (Carandini & Heeger, 1994; Nelson, 1994) has suggested that changing G by shunting inhibition would be a useful way to control the gain of a cell; when the inhibitory input rate increases, the slope of the input-output relationship decreases (see Figure 1A), but the threshold does not change much. On the other hand, inhibition that is not of the shunting variety should have a subtractive effect on the input-output relationship. If the reversal potential of the inhibition is far from the spiking threshold, then the inhibitory synapse will act more like a current source; the cell’s conductance is not changed much, but a hyperpolarizing current is injected. This current simply shifts the input-output relationship by changing Isyn to Isyn − Iinh (see Figure 1B) where Iinh is the inhibitory current. Simplified models based on a generator potential ignore the effect of the spiking mechanism and assume that the behavior of the neuron above threshold is adequately described by the subthreshold equations. We show that because of the spiking mechanism, changing the membrane leak con-
Shunting Inhibition and Firing Rates
1003
ductance by shunting inhibition does not have a divisive effect on firing rate, casting doubt on the hypothesis that such a mechanism serves to normalize a cell’s response. A similar conclusion has been reached independently for certain classes of neuron models (Payne & Nelson, 1996). 2 Methods Compartmental simulations were done using the model described by Bernander, Douglas, Martin, and Koch (1991); Bernander, Koch, and Douglas (1994); Bernander (1993); and Koch, Bernander, and Douglas (1995). The geometry for the compartmental models was derived from a large layer V pyramidal cell and a much smaller layer IV spiny stellate cell stained during in vivo experiments in cat (Douglas, Martin, & Whitteridge, 1991) and reconstructed. Geometries of both cells are shown as insets in Figure 2. Each model has the same eight active conductances at the soma, including an A current and adaptation currents (see Koch et al., 1995, for details). The somatic conductance values were different for each cell, but the same conductance per unit area was used for each type of channel. Dendrites were passive. Simulations were performed with the program Neuron (Hines, 1989, 1993). To study the effect of GABAA synapses and shunting inhibition, we did not explicitly model each synapse; we set the membrane leak conductance and reversal potential at each location in the dendritic tree to be the timeaveraged values expected from excitatory and inhibitory synaptic bombardment at presynaptic input firing rates fE and fI (as described in Bernander et al., 1991). To compute the time-averaged values, synapses were treated as alpha functions with a given time constant and maximum conductance (g(t) = gmax te−t/τ e−1 /τ ). The area density of synapses at a given location on a dendrite was a function of the length of dendrite that separated the area from the soma (see Table 1). Two different sets of densities (“near” and “far”) were used, depending on whether inhibitory synapses were near the soma or far from the soma. The “near” configuration is identical to the distribution used by Bernander et al. (1991) and reflects the anatomical observation that inhibitory synapses are mostly located near the soma in cortical pyramidal cells. The “far” configuration is not intended to be anatomically realistic. For simplicity, we used the same number of synapses for both the spiny stellate cell and the layer V pyramidal cell models. Our integrate-and-fire model is described by C
dV = −Vgleak + Isyn dt
for V < Vth
(2.1)
where gleak is the input conductance, C is the capacitance, and Isyn is the synaptic input. When the voltage V exceeds a threshold Vth , the cell emits a spike and resets its voltage back to 0. We used C = 1 nF, gleak = 16 nS, Vth = 16.4 mV, which matches the adapted current-discharge curve of the
1004
Gary R. Holt and Christof Koch
Table 1: Parameters of Synapses in the Compartmental Models
Type
Reversal Number gmax
τ
Potential
Area Density Near
1 nS
0.5 ms 1 + tanh
GABAA −70 mV 500
1 nS
5 ms
GABAB −95 mV 500
0.1 nS 40 ms
AMPA
0 mV
4000
³ x − 40 ´ 22.73
e−x/50
Far 1 + tanh 1 + tanh
xe−x/50
³ x − 100 ´ ³ x22.73 ´ − 100 22.73
Note: x is the length of dendrite in µm that separates this compartment from the soma. The “near” area density was used for Figure 2; the “far” area density was used for Figure 4. The normalization for the area density is not included in the expression because it depends on the geometry; different neurons have different fractions of their membrane at a given distance from the soma.
layer V pyramidal cell model quite well (not shown). Results from this article are not changed if a refractory period or an adaptation conductance is added to the integrate-and-fire model. 3 Results We first describe our results and reasoning for an “anatomically correct” distribution of synaptic inhibition in relative close proximity to the cell body. Since we fail to observe silent inhibition act divisively but would like to know under which circumstances it can act so, we next investigate the anatomically less realistic situation of more distal inhibition. 3.1 Proximal Inhibition. Changing gleak does not change the slope of the current-discharge curve for integrate-and-fire cells (see Figure 2A); it primarily shifts the curves. It therefore has a subtractive rather than a divisive effect. The compartmental models’ behavior is very similar to that of the integrate-and-fire unit. For two different geometries (a layer V pyramid and a layer IV spiny stellate cell), we computed the fully adapted firing rate as a function of the excitatory synaptic input rate for various different rates of inhibitory input to synapses with GABAA receptors (see Figure 2B). The slope of the input-output relationship does not change when the GABAA input amplitude is changed; the entire curve shifts. The same effect can be observed when considering the current-discharge curves (not shown). This effect is most easily understood in the integrate-and-fire model. In the absence of any spiking threshold, V would rise until V = Isyn /gleak (see Figure 3). Under these conditions, the steady-state leak current is propor-
Shunting Inhibition and Firing Rates
80
Firing rate (Hz)
B.100
80
Firing rate (Hz)
A. 100 60 40 20 0
0
0.5 1 Current (nA)
1.5
1005
60 40 20 0
0
2
4
6 8 10 0 2 4 Excitatory synaptic input rate (Hz)
6
8
Figure 2: Changing gleak has a subtractive rather than a divisive effect on firing rates. (A) Current-discharge curves for the integrate-and-fire model, with gleak varying from 10 to 70 nS (from left to right) in steps of 10 nS. (B) Fully adapted firing rates of the two cells as a function of excitatory input rate for different inhibitory input rates. From left to right, the curves correspond to a GABAA inhibitory rate of 0.5, 2, 4, and 6 Hz. In all of these cases, the curve shifts rather than changes slope. In this case, inhibitory synapses were near the soma (“near” configuration in Table 1), as found in cortical cells.
tional to the input current. However, if there is a spiking threshold, V never rises above Vth . Therefore, no matter how large the input current is, the leak current can never be larger than Vth gleak . We can replace the leak conductance by a current whose value is equal to the time-average value of the current through the leak conductance (hIleak i = gleak hVi), and simplify the leaky integrate-and-fire unit to a perfect integrator. (Now, however, hIleak i will be a function of Isyn .) If the current is suprathreshold, the cell will still RT fire at exactly the same rate because the same charge 0 (Isyn − Ileak ) dt is deposited on the capacitor during one interspike interval T, although for the leaky integrator the deposition rate is not constant. For constant just suprathreshold inputs, hVi will be close to Vth , and hIleak i will be large. For larger synaptic input currents, the time-averaged membrane potential becomes less and less (since V has to charge up from the reset point), and therefore the time-averaged leak current decreases for increasing inputs (compare Figures 3A and 3B). It can be shown that
hIleak i =
Isyn Vth gleak
µ
Isyn 1 + Vth gleak log(1 − Vth gleak /Isyn )
¶
if Isyn < Vth gleak otherwise.
(3.1)
For large Isyn , and even for quite moderate levels of Isyn just above Vth gleak , the lower expression is approximately equal to gleak Vth /2, independent of Isyn (see Figure 3C). Therefore, it is a good approximation to replace the leak conductance by a constant offset current. The current-discharge curve for
1006
Gary R. Holt and Christof Koch
the resulting perfect integrate-and-fire neuron is simply1 f (I) =
Isyn − hIleak i Isyn gleak . = − CVth CVth 2C
(3.2)
Shunting inhibition (varying gleak ) above threshold acts as a constant, hyperpolarizing current source, quite distinct from its subthreshold behavior. A similar mechanism explains the result for the compartmental models. Here, the voltage rises above the firing threshold, but the spiking mechanism acts as a kind of voltage clamp on a long time scale (Koch et al., 1995) so that the time-averaged voltage remains approximately constant. Furthermore, spiking conductances are so large that during a spike, any proximal synaptic conductances will be ignored. 3.2 Distal Inhibition. Distal synapses are not so tightly coupled electrically to the soma, so one might expect that distal GABAA inhibition might act divisivly. Since the spiking mechanism can be thought of as a kind of voltage clamp (Koch et al., 1995), one can study the neuron’s response by examining the current into the soma when it is clamped at the time-averaged voltage Vs (Abbott, 1991). For analysis, we simplify the dendritic tree into a single finite cable that has an excitatory synapse (conductance gE , reversal potential EE ) and an inhibitory synapse (gI and EI ) located at the other end. The cable has length l, radius r, specific membrane p conductance Gleak , intracellular resistivity Ri , and a length constant λ = r/(2Ri Gleak ). Using the cable equation, one can show that at steady state, the current flowing into the soma from the cable is Isoma = −g∞ Vs + 2g∞ e−L
gE (EE − Vs e−L ) + gI (EI − Vs e−L ) + g∞ Vs e−L , (gE + gI )(1 − e−2L ) + g∞ (1 + e−2L )
(3.3)
where L = l/λ is the electrotonic length of the cable and g∞ = π r3/2 p 2Gleak /Ri is the input conductance of the cylinder but with infinite length. (In this equation, all voltages are relative to the leak reversal potential, not to ground.) Despite the simplification involved in equation 3.3, it qualitatively describes the response of the compartmental model. First, in the absence of any cable (L = 0), this equation becomes linear in both gE and gI , and inhibition acts to subtract a constant amount from Isoma , as we have shown above. 1 This expression can also be derived from the Laurent expansion of the current discharge curve for a leaky integrator, f (Isyn ) = −gleak /C log(1 − Vth gleak /Isyn ), in terms of 1/Isyn around Isyn = ∞ (Stein, 1967).
Ileak (t) = V(t) g leak
Shunting Inhibition and Firing Rates
1
1007
1
A.
B.
Isyn=0.27 nA
0.5
Isyn=1 nA
0.5
V thg leak 0
0
100
200
0
300 0 Time (ms)
100
200
300
0.4
C. 0.2 0
0
Spikeless
V thg leak
Integrate-and-fire
V thg leak/2
0.5
1
1.5
Isyn (nA) Figure 3: Why shunting inhibition has a subtractive rather than a divisive effect on an integrate–and–fire unit. (A) The time-dependent current across the leak conductance Ileak (in nA) in response to a constant 0.5 nA current into a leaky integrate-and-fire unit with (solid line) and without (dashed line) a voltage threshold, Vth . The sharp drops in Ileak occur when the cell fires, since the voltage is reset. (B) Same for a 1 nA current. Note that Ileak with a voltage threshold has a maximum value well below Ileak without a voltage threshold. (C) Timeaveraged leak current hIleak i in nA as a function of input current, computed from the analytic expression. Below threshold, the spikeless model and the integrateand-fire models have the same Ileak , but above threshold hIleak i drops for the integrate-and-fire model because of the voltage threshold. For Isyn just greater than threshold, the cell spends most of its time with V ≈ Vth , so hIleak i is high (panel A; at threshold, Ileak = Vth gleak ). For high Isyn , the voltage increases approximately linearly with time, so V has a sawtooth waveform as shown in panel B. This means that hIleak i = (max Ileak )/2 = Vth gleak /2.
When L 6= 0, some divisive effect is expected since gI appears in the denominator. However, a subtractive effect will persist due to the term containing gI in the numerator. The reversal potential of GABAA synapses (increasing a membrane conductance to chloride ions) is in the neighborhood of −70 mV relative to ground, while the time-averaged voltage when the model neuron is spiking is around −50 mV (Koch et al., 1995). When a cell is spiking, therefore, a nonzero driving potential exists for GABAA inhibition. In the pyramidal cell model, the subtractive effect turns out to be much more prominent than the divisive effect even for quite distant inhibi-
1008
Gary R. Holt and Christof Koch
A. EI = −70 mV
B. EI = −50 mV
2 Clamp 1 Current (nA) 0 80 60 Firing Rate 40 (Hz) 20 0
0
5
10
15 0
5
10
15
Excitatory Firing Rate (Hz) Figure 4: The effect of more distant inhibition on firing rates for the layer V pyramidal cell. (A) The effect on the input-output relationship when inhibitory synapses are more distant from the soma (“far” configuration in Table 1). Inhibitory rates are 0.5 Hz (solid) and 8 Hz (dashed). In these simulations, EI , the GABAA reversal potential, had its usual value of −70 mV. Top: The voltage clamp current when the soma is clamped to −50 mV, close to the spiking threshold of the cell. Bottom: The adapted firing rate when the soma is not clamped. Very little divisive effect is visible on the firing rate; there is a slope change for firing rates less than 20 Hz, but this is too small to have a significant effect. (B) Same as A, except that EI and the reversal potential of all leak conductances were changed to −50 mV so there is no driving force behind the GABAA synapses or the membrane passive conductance. A clear change in slope for low firing rates is evident. However, even for this rather unphysiological parameter manipulation, subtraction prevails at moderate and high input rates.
tion (see Figure 4A). Both the inhibitory and excitatory synapses have been moved to more than 100 µm (which is more than 1 λ) away from the soma. To a good approximation, inhibition still subtracts a constant from both the current delivered to the soma and the firing rate of the cell. Equation 3.3 predicts that if the term containing gI is removed from the numerator, then a divisive effect might be visible. When we changed the reversal potential EI of the GABAA synapses and the leak reversal potential Eleak to −50 mV, we found that there is indeed an observable change in slope at a low firing rate (see Figure 4B). For higher firing rates, however, inhibition still acts approximately subtractively. We demonstrate this in the extreme case of moving both EI and the reversal potential associated with the leak
Shunting Inhibition and Firing Rates
1009
Isoma,1
Isoma
Isoma,2
Difference 0
0.5
1
1.5
2 gE
2.5
3
3.5
4
Figure 5: When gE is not small compared to gI , then inhibition acts more subtractively than divisively even when the IPSP reversal potential is equal to the somatic voltage. The two upper curves are Isoma as a function of gE for two different values of gI such that gI + B changes by a factor of two in equation 3.4. The lower curve is the difference between those two curves. For gE > gI + B, it is approximately constant over a large range. Units of gE are such that gI + B = 1 for the smaller value of gI .
conductance to −50 mV (also the value to which the somatic terminal of the cable is clamped). Under these conditions, equation 3.3 simplifies to Isoma (gE , gI ) = constant + A
gE gE + gI + B
(3.4)
where A and B are independent of gE and gI . Clearly if gE is small, changing gI simply changes the slope. When gE is not small, then it turns out that changing gI has an effect that is more subtractive than divisive (see Figure 5). Since in the “far” model both kinds of synapses have the same distribution, their firing rates are proportional to the conductances. With the synaptic parameters we have used (see Table 1), gE /gI = 0.8 fE /fI ; therefore, we expect to see a subtractive effect when gE > 6 Hz, and this is approximately true (see Figure 4B). 4 Discussion Divisive normalization of firing rates has become a popular idea in the visual cortex (Albrecht & Geisler, 1991; Heeger, 1992; Heeger, Simoncelli, &
1010
Gary R. Holt and Christof Koch
Movshon, 1996). It has been suggested that this is accomplished through shunting inhibitory synapses activated by cortical feedback (Carandini & Heeger, 1994; for a similar suggestion in the electric fish, see Nelson, 1994). Most discussions of shunting inhibition have assumed that the voltage at the location of the shunt is not constrained and may rise as high as necessary (e.g., Blomfield, 1974; Torre & Poggio, 1978; Koch, Poggio, & Torre, 1982, 1983). However, when the shunt is located close to the soma, the voltage at the site of the shunt cannot rise above the spike threshold. Therefore, the current that can flow into the cell through the shunting synapse is limited and at moderate rates becomes a constant offset (see Figure 3C). Under these circumstances, shunting inhibition implements a linear subtractive operation. Even if the conductance change is not located close to the soma, it may not have a divisive effect (see Figure 4A). First, when the cell is spiking, shunting inhibition is not “silent”; there is a significant driving force behind GABAA inhibition, since the somatic voltage is clamped to approximately −50 mV by the spiking mechanism, and the reversal potential for GABAA inhibition is in the neighborhood of −70 mV. Second, even if the reversal potential for GABAA and the leak reversal potential are set to −50 mV, inhibition acts divisively only if the excitatory synaptic conductance is small compared to the inhibitory conductance (see Figures 4B and 5). Large excitatory conductances are expected when the cell receives significant input (Bernander et al., 1991; Rapp, Yarom, & Segev, 1992) so the subtractive effect of large conductances is relevant physiologically. Current-discharge curves are affected as predicted by the simple integrate-and-fire model in response to inhibitory postsynaptic potential and GABA iontophoresis in motoneurons in vivo (Granit, Kernell, & Lamarre, 1966; Kernell, 1969) and cortical cells in vitro (Connors, Malenka, & Silva, 1988; Berman, Douglas, & Martin, 1992). The input-output curves for different amounts of inhibition do not diverge for larger inputs, as would be required for a divisive effect; in fact, they converge at high rates because of the refractory period (Douglas & Martin, 1990). In recordings from Limulus eccentric cells, current-discharge curves show both a slope change and a shift (Fuortes, 1959) because the site of current injection is distant from the site of spike generation.2 Rose (1977) showed that iontophoresis of GABAA onto an in vivo cortical network appeared to act divisively. Because shunting inhibition has a subtractive effect on single cells, this could possibly be caused by a network effect (Douglas, Koch, Mahowald, Martin, & Suarez, 1995). For synapses close to the spike-generating mechanism, as well as for the integrate-and-fire unit, the subtractive effect of conductance changes does
2 In this case, only the current-discharge curve was measured; the considerations in Figures 4 and 5 are not relevant.
Shunting Inhibition and Firing Rates
1011
not depend on the reversal potential of the conductance. Changing the reversal potential is equivalent to adding a constant current source in parallel with the conductance, and in a single compartment model this will obviously merely shift the current-discharge curve. Therefore, like inhibitory input, proximal excitatory input does not change the gain of other superimposed excitatory input. Our analysis assumes that synaptic inputs change on a time scale slower than an interspike interval. High temporal frequencies may be present in synaptic input currents for irregularly spiking neurons. Furthermore, our analysis assumes passive dendrites; active dendritic conductances complicate the interaction of synaptic excitation and inhibition. Although we cannot rule out that under some parameter combinations shunting inhibition could act divisively on the firing rates, we have not found such a range for physiological conditions. In combination with our integrate-and-fire and single cable models, we believe that a different mechanism is necessary to account for divisive normalization. Acknowledgments and Notes We thank the referees for helpful critical comments on this article. This research was supported by the NIMH and the Sloan Center for Theoretical Neuroscience. The compartmental models and associated programs are available from: ftp://ftp.klab.caltech.edu/pub/holt/holt and koch 1997 normalization. tar.gz. References Abbott, L. F. (1991). Realistic synaptic inputs for model neural networks. Network 2:245–258. Albrecht, D. G., & Geisler, W. S. (1991). Motion selectivity and the contrastresponse function of simple cells in the visual cortex. Vis. Neurosci. 7:531–546. Berman, N. J., Douglas, R. J., & Martin, K. A. C. (1992). GABA-mediated inhibition in the neural networks of visual cortex. Prog. Brain Res. 90:443–476. ¨ (1993). Synaptic integration and its control in neocortical pyramidal Bernander, O. cells. Unpublished doctoral dissertation, California Institute of Technology, Pasadena. ¨ Douglas, R., Martin, K., & Koch, C. (1991). Synaptic background Bernander, O., activity determines spatio-temporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. USA 88:1569–1573. ¨ Koch, C., & Douglas, R. J. (1994). Amplification and linearization Bernander, O., of distal synaptic input to cortical pyramidal cells. J. Neurophysiol. 72:2743– 2753. Blomfield, S. (1974). Arithmetical operations performed by nerve cells. Brain Res. 69:114–124.
1012
Gary R. Holt and Christof Koch
Carandini, M., & Heeger, D. J. (1994). Summation and division by neurons in primate visual cortex. Science 264:1333–1335. Connors, B. W., Malenka, R. C., & Silva, L. R. (1988). Two inhibitory postsynaptic potentials, and GABAA and GABAB receptor-mediated responses in neocortex of rat and cat. J. Physiol. 406:443–468. Douglas, R. J., Koch, C., Mahowald, M., Martin, K., & Suarez, H. (1995). Recurrent excitation in neocortical circuits. Science 269:981–985. Douglas, R. J., & Martin, K. A. C. (1990). Control of neuronal output by inhibition at the axon initial segment. Neural Comp. 2:283–292. Douglas, R. J., Martin, K. A. C., & Whitteridge, D. (1991). An intracellular analysis of the visual responses of neurones in cat visual cortex. J. Physiol. 440:659–696. Fuortes, M. G. F. (1959). Initiation of impulses in visual cells of Limulus. J. Physiol. 148:14–28. Granit, R., Kernell, D., & Lamarre, Y. (1966). Algebraical summation in synaptic activation of motoneurons firing within the “primary range” to injected currents. J. Physiol. 187:379–399. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Vis. Neurosci. 9:181–197. Heeger, D. J., Simoncelli, E. P., & Movshon, J. A. (1996). Computational models of cortical visual processing. Proc. Natl. Acad. Sci. USA 93:623–627. Hines, M. (1989). A program for simulation of nerve equations with branching geometries. Int. J. Biomed. Comp. 24:55–68. Hines, M. (1993). The NEURON simulation program. In J. Skrzypek (Ed.), Neural Network Simulation Environments. Boston: Kluwer. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA 81:3088–3092. Kernell, D. (1969). Synaptic conductance changes and the repetitive impulse discharge of spinal motoneurons. Brain Res. 15:291–294. ¨ & Douglas, R. J. (1995). Do neurons have a voltage or a Koch, C., Bernander, O., current threshold for action potential initiation? J. Comp. Neurosci. 2:63–82. Koch, C., Poggio, T., & Torre, V. (1982). Retinal ganglion cells: A functional interpretation of dendritic morphology. Proc. R. Soc. Lond. B 298:227–264. Koch, C., Poggio, T., & Torre, V. (1983). Nonlinear interaction in a dendritic tree: Localization timing and role in information processing. Proc. Natl. Acad. Sci. USA 80:2799–2802. Nelson, M. E. (1994). A mechanism for neuronal gain control by descending pathways. Neural Comp. 6:242–254. Payne, J. R., & Nelson, M. E. (1996). Neuronal gain control by regulation of membrane conductance in two classes of spiking neuron models. Fifth Annual Computational Neuroscience Meeting Abstracts, p. 147. Rapp, M., Yarom, Y., & Segev, I. (1992). The impact of parallel fiber background activity on the cable properties of cerebellar Purkinje cells. Neural Comp. 4:518–533. Rose, D. (1977). On the arithmetical operation performed by inhibitory synapses onto the neuronal soma. Exp. Brain Res. 28:221–223.
Shunting Inhibition and Firing Rates
1013
Stein, R. B. (1967). The frequency of nerve action potentials generated by applied currents. Proc. R. Soc. Lond. B 167:64–86. Torre, V., & Poggio, T. (1978). A synaptic mechanism possibly underlying directional selectivity to motion. Proc. R. Soc. Lond. B 202:409–416.
Received August 19, 1996; accepted October 31, 1996.
Communicated by John Rinzel
Reduction of the Hodgkin-Huxley Equations to a Single-Variable Threshold Model Werner .M. Kistler Wulfram Gerstner J. Leo van Hemmen lnstitutfur Theoretische Physik, Physik-Department der TUMiinchen,0-85747 Garching bei Miinchen, Germany It is generally believed that a neuron is a threshold element that fires when some variable u reaches a threshold. Here we pursue the question of whether this picture can be justified and study the four-dimensional neuron model of Hodgkin and Huxley as a concrete example. The model is approximated by a response kernel expansion in terms of a single variable, the membrane voltage. The first-order term is linear in the input and its kernel has the typical form of an elementary postsynaptic potential. Higher-order kernels take care of nonlinear interactions between input spikes. In contrast to the standard Volterra expansion, the kernels depend on the firing time of the most recent output spike. In particular, a zeroorder kernel that describes the shape of the spike and the typical afterpotential is included. Our model neuron fires if the membrane voltage, given by the truncated response kernel expansion, crosses a threshold. The threshold model is tested on a spike train generated by the HodgkinHuxley model with a stochastic input current. We find that the threshold model predicts 90 percent of the spikes correctly. Our results show that, to good approximation, the description of a neuron as a threshold element can indeed be justified. 1 Introduction
Neuronal activity is the result of a highly nonlinear dynamic process that was first described mathematically by Hodgkin and Huxley (1952), who proposed a set of four coupled differential equations. The mathematical analysis of these and other coupled nonlinear equations is known to be a hard problem, and an intuitive understanding of the dynamics is difficult to obtain. Hence, a simplified description of neuronal activity is highly desirable and has been attempted repeatedly, for example, by FitzHugh (1961); Rinzel(l985); Abbott and Kepler (1990); Av-Ron, Pamas, and Segel(l991); Kepler, Abbott, and Marder (1992); and Joeken and Schwegler (1995), to mention only a few. An example of a simple model of neuronal spike dynamics is the integrate-and-fire neuron, which dates back to Lapicque (1907) Neural Computation 9,1015-1045 (1997)
@ 1997 Massachusetts Institute of Technology
1016
Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen
and has become quite popular in neural networks modeling (see,e.g., Usher, Schuster, and Niebur, 1993;Abbott & van Vreeswijk, 1993;Tsodyks, Mitkov, & Sompolinsky, 1993; Hopfield & Herz, 1995). In this article we aim at a generalized, and realistic, version of the integrate-and-fire model. In order to reduce the four nonlinear equations of Hodgkin and Huxley to a simplified model with a single, scalar variable u, we use the spike response method, which has been introduced previously (Gerstner, 1991,1995;Gerstner &van Hemmen, 1992).A spike is triggered at a time tf if the membrane potential u approaches a threshold 0 from below. For f > tf and tf being the most recent firing time, the membrane voltage u in the reduced description is given by an expansion of the form,
0
The following spike is generated if u ( t ) = 0 and $ u ( t ) > 0, which defines the next firing time t f , and so on. The kernels q and E have a direct neuronal interpretation. The first term v ( t - tf) in the right-hand side of equation 1.1 describes the form of a spike and the typical afterpotential following it. The action potential is triggered at t = t f , it reaches its maximum after a time 6, and it is followed by a period of hyperpolarization that extends over approximately 15 ms (see Figure la). From a different point of view, we can think of v ( t - tf) as the neuron’s response to the threshold crossing at tf. Similarly, the kernel 6 describes the response of the membrane potential to an input current 1. More precisely, t;”(cm;s) as a function of s is the linear response to a small current pulse at time s = 0 given that the neuron did not fire in the past. In biological terms, E ( ’ ) describes the form of the postsynaptic potential evoked by an input spike (see Figure lb). Since the responsiveness of the membrane to input pulses is reduced during or shortly after an action potential, the form of the postsynaptic potential depends also on the time f - tf that has passed since the last output spike. Higher-order terms in equation 1.1 describe the nonlinear interaction between several input pulses. We concentrate here on the four-dimensional set of equations proposed by Hodgkin and Huxley to describe the spike activity in the giant axon of the squid (Hodgkin & Huxley, 1952).We consider it as a well-studied standard model of spike activity, even though it does not provide a correct description of spiking in cortical neurons of mammals. The methods introduced below are, however, more general and can also be applied to more detailed neuron models, which often involve tens or hundreds of variables (Yamada, Koch, & Adams, 1989; Wilson, Bhalla, Uhley, & Bower, 1989; Traub, Wong, Miles, & Michelson, 1991; and Ekeberg et al., 1991).
Reduction of the Hodgkin-Huxley Equations
1017
1
0
5
15
10
20
1
-+
0.8
u
.-
?I
-
w
0.6
0.4 0.2
-3 3
w
0
-0.2 0
5
10
15
20
tlms (b)
Figure 1: Kernels corresponding to the Hodgkin-Huxley equations and obtained by the methods described in this article. (a) The kernel ql describes the typical shape of a spike and the afterpotential. The spike has been triggered at time t = 0. (b) The first-order kernel ti’’ describes the form of the postsynaptic potential. The solid line in (b) shows the postsynaptic potential evoked by a short pulse of unit strength that has arrived at time t = 0 while no spikes had been generated in the past. If there was an output spike At = 6.5 ms (dotted line) or A t = 10.5ms (dashed line) before the arrival of the input pulse, the response is significantly reduced. The solid line corresponds to At = co.
Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen
1018
2 Theoretical Framework 2.1 Standard Volterra Expansion. In order to approximate the membrane voltage u of the Hodgkin-Huxley equations, we start with a Volterra expansion (Volterra, 1959),
i
~ ( t=) ds cA1)(S)Z(f - S ) 0
eA2'(s,s')Z(t - s)Z(t - s') + . . . 0
(2.1)
0
The first-order term cA')(s) describes the membrane's linear response to an external input current Z(f), the second-order term cA2)(s,s') takes into account nonlinear interactions between inputs at two different times, and the following terms take care of higher-order nonlinearities. We would like to stress that equation 2.2 is an ansatz whose usefulness we still have to verify. It is by no means evident that the series converges reasonably quickly. 2.2 Expansion During and After Spikes. When dealing with series expansions, we have to examine their convergence. As we will see, the expansion 2.1 converges as long as the input is so weak that no spikes are generated. If a spike is triggered at time ff by some strong input, the neuronal dynamics jumps onto a trajectory in phase space that is always (nearly) the same-the action potential. During the spike, the evolution is determined by nonlinear intrinsic processes and is hardly influenced by external input. We therefore replace equation 2.1 by x
'J
+-2!
s
ds ds' ci2'(t - tf; s, s')Z(t - s)Z(t - s')
0
+. . .
(2.2)
0
Here q l ( t - tf) describes the evolution of the membrane potential during the spike and the relaxation process thereafter. In biological terms, ql ( t - tf) gives the form of a spike and the afterpotential. As in equation 2.1, the term 6;') describes the membrane's linear response to an input Z(t). Since inputs are much less effective during and shortly after a spike, the term 6;') depends on the time t - tf that has passed since the spike was triggered
Reduction of the Hodgkin-Huxley Equations
1019
at tf. The E ' S are called response kernels. The formal arguments justifying equation 2.2 are discussed in the appendix. The lower index of ci') indicates that we take into account the effects of the most recent spike only. We can systematically increase the accuracy of our description by including more and more spikes that have occurred in the recent past. Taking into account the most recent p spikes, we would have: u ( t ) = qp(t -
4. . .. , f - $)
Tx)
I] . . . . , t - $ ; s ) l ( t - s ) 0
'J J
I s, s')l(f - s)l(t - s') ds ds' cj2)(f - {,.. .,k - fp;
+-2!
0
0
+...
(2.3)
In general, qp(al,. . . , ap)is a complicated function of the p arguments. Since we expect that adaptation is the dominant effect, we make the ansatz U
(2.4) k=l
Here q1 ( a )describes a single spike and its afterpotential. A linear summation of the afterpotentials of several preceding spikes is often sufficient to describe the well-known adaptation effects of neuronal activity (Gerstner & van Hemmen, 1992; Ekeberg et al., 1991; Kernel1 & Sjoholm, 1973). We will show that for the Hodgkin-Huxley model, excellent results are achieved with equation 2.2, that is, with p = 1.In other words, only the most recent output spike affects the dynamics. This is not too surprising since the Hodgkin-Huxley equations show no pronounced adaptation effects. 2.3 Relation to Integrate-and-Fire Models. In passing we note that the integrate-and-fire model is a special case of equation 2.3 with the response kernels (Gerstner, 1995)
where uo is the voltage to which the system is reset after a spike, T is the membrane time constant, and o(.)is the Heaviside step function with @(s)= 1 for s 1 0 and 0 otherwise. Since the standard integrate-and-fire model is linear except at firing, the determination of the kernels is simple. In particular, all terms beyond 6:') vanish.
Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen
1020
Equation 2.5 applies to the "one-step" integrate-and-fire model without synaptic or dendritic integration, but generalizations are straightforward. We mention three types of generalization: 1. The synaptic input current is not a &pulse but is described by a broad pulse a@) where s = 0 is the time of presynaptic spike arrival. In this case, the response kernel 6;') has a finite rise time proportional to the width of the input pulse a.
2. Dendritic integration can be incorporated by a broad input pulse &. The effect is the same as in the first generalization. 3. We can include a time-dependent threshold #O + # ( a )where IJ is the time that has passed since the last output spike. Points 1 and 3 are not relevant for a comparison with the Hodgkin-Huxley model, since the equations of Hodgkin and Huxley do not describe synaptic or dendritic transmission processes either. With respect to point 3, it is easy to see that a time-dependent threshold is equivalent to changing the form of the ql-kernel. The firing condition is 00
0
Subtracting @(a) on both sides of the equation leads back to equation 1.1 with q ( a )= q I ( a )- # ( a ) . In section 4.3, we test the performance of integrate-and-fire models with and without time-dependent threshold on a scenario with a fluctuating input current and compare the results with the Hodgkin-Huxley model. 2.4 Derivation of the Kernels. For complicated neuron models described by several nonlinear differential equations, the kernels can be derived systematically by the following procedure. In order to compute the response kernel $), we linearize the equations around the constant solution u ( t ) = U with zero input I(t) = 0. The kernel 6;'' is the voltage component of the Green's function of the linearized equation. Second and higher-order kernels can be obtained analytically by more involved methods. The mathematical details are in the Appendix. We have explicitly calculated the kernels c:'), . . . for the HodgkinHuxley equations using a computer algebra system. Albeit the length of the expressions grows rather fast-the third-order kernel has about 400 simple terms-we found it more convenient, faster, and less consuming of computer memory to calculate the kernels analytically than using a numerical procedure. It is instructive, however, to see how the kernels can be obtained numerically. To get the linear kernel, we solve the Hodgkin-Huxley equations
,.A3'
Reduction of the Hodgkin-Huxley Equations
1021
numerically for an input current Z(t) = c ito(t),where it,,(t) is a short current pulse of unit strength at to, for example, ito(t)= 5-l @(t- to s/2) 8 ( s / 2t + t o ) with r << 1ms. This is an approximate &function. We denote the voltage component of the solution by u [ c it,](t)and set c,!jl’(t) = ( u [ c it,](t t o ) - ii)/ c in the limit c -+ 0 and r + 0, where U is the steady-state solution for zero input current. In a similar vein, we measure the voltage response to two current pulses so as to find c,!jz’. The improved kernel c:’) can only be found numerically. This is done in a similar way as above. We use an input current made up of two current pulses: Z(t) = c1 it, ( t ) c2 it,(t). The first pulse at tl is strong enough to generate an output spike at tf. The membrane potential after the spike serves as a reference trajectory U[CIif,]. Now we add the second pulse at time t2 and consider the difference,
+
+
su = lim c; c p o
1
{ u[cI it, + cz it2] - u[c1 it, I ] .
(2.7)
r-O
The difference S u ( t ) gives the values of the linear response kernel c;’’(a; s) on a ”diagonal line” with (a;s) = (s t Z - tf; s),
+
The procedure is repeated for variable offsets f Z - tf by changing the interval between the first and the second pulse. Since the effect of the spike at tf decreases for t >> tf, we have (2.9) where e,!jl)(s)is the analytically derived kernel introduced above. Finally, the kernel 9 is determined by solving the Hodgkin-Huxley equations numerically with input Z(t) = c ito(t).The amplitude c has been chosen sufficiently large so as to evoke a spike. The exact value of c is not important. We set q l ( t - tf) = lim,,o u[c ifJ(t) - ii for t > to where tf is the time when u[c it,](t)crosses a formal threshold 19 and ii is the constant solution with zero input. Once the amplitude c is fixed, the kernel 9 is uniquely determined as the form of an action potential with, apart from the triggering pulse, zero input current. The only free parameter is the threshold 9 , which is to be found by an optimization procedure described below. 2.5 Comparison with Wiener Expansions. The approach indicated so far shows that all kernels can be found by a systematic and straightforward procedure. The kernels can be derived either analytically, as described in
1022
Werner M. Kistler, Wulfram Gerstner, and J.Leo van Hemmen
the appendix, or numerically by studying the system’s response to input pulses on a small set of examples. This is in contrast to the Wiener theory (Wiener, 1958; Palm & Poggio, 1977),which analyzes the response of a system to gaussian white noise. Since Wiener’s approach to the description of nonlinear systems is a stochastic one, the determination of the Wiener kernels requires large sets of input and output data (Palm & Poggio, 1978; Korenberg, 1989). Exploiting the deterministic response of the system to well-designed inputs (short pulses) thus simplifies things significantly.The study of impulse response functions is a well-known approach to linear or weakly nonlinear systems. Here we have extended this approach to highly nonlinear systems under the proviso that the response kernels 6 are (almost everywhere) continuous functions of their arguments. It is important to keep in mind that Volterra and Wiener seriescannot fully reproduce the threshold behavior of spike generation, even if higher-order terms are taken into account. The reason is that these expansions can only approximate mappings that are smooth, whereas the mapping from input current I ( t ) to the time course of the membrane voltage has an apparent discontinuity at the spiking threshold (see Figure 2). Of course, this is not a discontinuity in the mathematical sense, but the output is very sensitive to small changes in the input current (Cole,Guttman, & Bezanilla, 1970; Rinzel & Ermentrout, 1989).We have to correct for this by adding the kernel q each time the series expansion indicates that a spike will occur, say, by crossing an appropriate threshold value. In doing so, we no longer expand around the constant solution u ( t ) = 0, but around u ( t ) = q(t - tf), the time course of a standard action potential. The general framework outlined in this section will now be applied to the Hodgkin-Huxleyequations. 3 Application to Hodgkin-Huxley Equations
Applying the theoretical analysis to the Hodgkin-Huxley equations, we begin by specifying the equations and then explain how the reduction to the spike response model is performed.
3.1 Hodgkin-Huxley Spike Trains. According to Hodgkin and Huxley (Hodgkin & Huxley, 1952;Cronin, 1987),the voltage u ( t )across a small patch of axonal membran ( e g , close to the hillock) changes at a rate given by
where I ( t ) is a time-dependent input current. The constants U N ~ U, K , and U L are the equilibrium potentials corresponding to sodium, potassium, and ”leakagecurrents,” and the 8’s are parameters of the respective ion conduc-
1023
Reduction of the Hodgkin-Huxley Equations
10
15
20
25
30
tlms Figure 2: Threshold behavior of the Hodgkin-Huxley equations (see equations 3.1 and 3.2). We show the response of the Hodgkin-Huxley equations to a current pulse of l ms duration. A current amplitude of 7.0 p A cm-* suffices to evoke an action potential (solid line; the maximum at 100 mV is out of scale), whereas a slightly smaller pulse of 6.9 PA cm-2 fails to produce a spike (dashed line). The time course of the membrane voltage is completely different in both cases. Therefore, the mapping of the input current onto the time course of the membrane voltage is highly sensitiveto changes in the input around the spiking threshold. The bar in the upper left indicates the duration of the input pulse.
tances. The variables m, n, and h change according to a differential equation of the form dx
- = (rx(v)(1 - x ) - bx(u)x
dt
(3.2)
with x E ( m ,n,h). The parameters are given in Table 1. For a biological neuron that is part of a larger network of neurons, the input current is due to spike input from many presynaptic neurons. Hence the input current is not constant but fluctuates. In our simulations, we therefore use an input current generated by the following procedure. Every 2 ms, a random number is drawn from a gaussian distribution with zero mean and standard deviation u . The discrete current values at intervals of 2 ms are linearly interpolated and define the input I(t). This approach is intended to mimic the effect of spike input into the neuron. The procedure is somewhat arbitrary but easily implementable and leads to a spike train with a broad interval distribution and realistic firing rates depending on u , as shown in Figure 3. We emphasize that our single-variable model is intended to reproduce the firing times of the spikes generated by the Hodgkin-Huxley equations.
Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen
1024
Table 1: Parameters of the Hodgkin-Huxley Equations (Membrane Capacity, C = 1pF/cm2) X
vx
81
Nu K L
115mV -12mV 10.6mV
120mS/cm2 36mS/cm2 0.3mS/cm2
~
~~
n m h
(0.1 - 0.01 v)/ [exp(l - 0.1 V ) - 11 (2.5 - 0.1 v)/ Iexp(2.5 - 0.1 U ) - 1) 0.07exp(-v/ 20)
0.125 exp(-v / 80) 4exp(-v/ 18) 1/[exp(3-0.1v)+ll
This is a harder problem than fitting the gain function for constant input current. 3.2 Approximation by a Threshold Model. We want to reduce the four Hodgkin-Huxley equations (3.1 and 3.2) to the spike response model (see equation 2.2) in such a way that the model generates a spike train identical with or at least similar to the original one of Figure 3a. The reduction will involve four major steps:
1. Calculate the response functions cA1),cA2),and q,(3). 2. Derive the kernel
r]l.
3. Analyze the corrections to output spike.
that are caused by the most recent
4. Determine a threshold criterion for firing. Before starting, we note that for I ( t ) = f, the Hodgkin-Huxley equations allow a stationary solution with constant values v ( t )= 5,m ( t ) = Z, h ( t ) = h, and n ( t ) = Ti. The constant solution is stable if 1is smaller than a threshold value f , ~= 9.47pA/cm2, which is determined by the parameter values given in Table 1.
3.2.2 Standard Volterra Expansion. The kernels cA1)(sl) and cA2)(s1,s2) have been derived as indicated in the appendix and are shown in Figures l b (solid line) and 4. Figure 5 shows a small section of the spike train of Figure 3a with a single spike. Two intervals, one before (lower left) and a second during and immediately after the spike (lower right), are shown on an enlarged scale. Before the spike occurs, the numerical solution of the Hodgkin-Huxley equations (solid) is well approximated by the
1025
Reduction of the Hodgkin-Huxley Equations
200
4 10
600
time (4
/
ms
900.-
600v)
42
c
:
300-
25
50
75
100
IS1 / ms Figure 3: Hodgkin-Huxley model. (a) 1000 ms of a simulation of the HodgkinHuxley equations stimulated through a fluctuating input current I ( t ) with zero mean and standard deviation 3pA/cm2. Because of the fluctuating input, the spikes occur at stochastic intervals with a broad distribution. (b)A histogram of the interspike interval (ISI) sampled over a period of 100 s.
first-order Volterra expansion u ( t ) = I d s cA1)(s)I ( t - s) (long dashed line) and even better by the second-order approximation (dotted line). During a spike, however, even an approximation to third order fails completely (see Figure 6). The discrepancy during a spike may be surprising at first sight but is not completely unexpected. During action potential generation, the neuronal dynamics is mainly driven by internal processes of channel opening and closing, much more than by external input. Mathematically, the reason is
Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen
1026
-
20
r 4
0
I .
- 0
15 vl
E
.
\
10
Y
5 0 0
5
10
15
20
I
s/ms Figure 4 Second-order kernel for the Hodgkin-Huxley equations. The figure exhibits a contour plot for the second-orderkernel cA2)(s,s’). The kernel vanishes identically for negative arguments, as is required by causality. that the series expansion is valid only as long as I ( t ) is not too large; see the Appendix.
3.2.2 Expansion lncluding Action Potentials. In order to account for the remaining differences, we have to add explicitly the shape of the spike and the afterpotential by way of the kernel r,q(s). The function ql(s) has been determined by the procedure explained in section 2 and is shown in Figure la. If we add ~1(t -t f ) but continue to work with the standard kernels q,(1) (s), q,(2) (s, s’), .. . we find large errors in an interval up to roughly 20 ms after a spike.This is exemplifiedin the lower right of Figure 5 where the longdashed line shows the approximation u(t) = q l ( t - tf) + Jds chl)(s) l ( t - s). From a biological point of view, this is easy to understand. The kernel c f ) ( s ) describes a postsynaptic potential in response to a standard input spike. Due to refractoriness, the responsiveness of the membrane is lower after a spike, and for this reason the postsynaptic potential looks different if an input pulse arrives shortly after an output spike. We then have to work with the improved kernels E ~ ) ( c s), ; ci2)(o;s, s’), . . . introduced in equation 2.2. The dotted line in the lower-right graph of Figure 5 gives the approximation u ( t ) := r,q(t - t j ) + Jds ci’)(t - tf; s) l(t - s), and we see that the fit is nearly perfect.
3.2.3 Threshold Criterion. The trigger time tf of an action potential is given by a simple threshold crossing process. Although the expansion in terms of the E kernels will never give the perfect form of the spike, the
1027
Reduction of the Hodgkin-Huxley Equations
time I ms
>
Figure 5: Systematic approximation of the Hodgkin-Huxley model by the spike response model (see equation 2.2). The upper graph shows a small section of the Hodgkin-Huxley spike train of Figure 3a. The lower left diagram shows the solution of the Hodgkin-Huxley equations (solid) and the first (dashed) and second (dotted) order approximation with the kernels 6:;’ and 6;’’ before the spike occurs on an enlarged scale. The lower right diagram illustrates the behavior of the membrane immediately after an action potential. The solution of the Hodgkin-Huxley equations (solid line) is approximated by u ( t ) = q l ( t t ’ ) + Sdsc:;’(s)I ( t - s), a dashed line. Using the improved kernel 6 ; ’ ’ instead of 6:;’ results in a nearly perfect fit (dotted line).
truncated series expansion does exhibit rather significant peaks at positions where the Hodgkin-Huxley model produces spikes (seeFigure 6). We therefore decided to apply a simple voltage threshold criterion instead of using some other derived quantity such as the derivative v or an effective input current (Koch, Bernander, & Douglas, 1995).Whenever the membrane potential in terms of the expansion (see equation 2.3) reaches the threshold 29 from below, we define a firing time tf and add a contribution q ( t - tf). The threshold parameter 29 is considered as a fit parameter and has to be optimized as described below.
Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen
1028
I
4
,
I--
-
2 o 215
u . 220
I 225
time
230
/
\.
235
1
240
ms
Figure 6: Approximation of the Hodgkin-Huxley model by a standard Volterra expansion (see equation 2.1) during a spike. Here we demonstrate the importance of the kernel q1 as it appears in equation 2.2. First- (long dashed) and second-order approximation (dashed) using the kernels and cf' produces a significant peak in the membrane potential where the Hodgkin-Huxley equations generate an action potential (solid line), but even the third-order approximation (dotted) fails to reproduce the spike with a peak of nearly 100 mV (far out of scale). The remaining difference is taken care of by the kernel q l .
~61'
4 Simulation Results
We have compared the full Hodgkin-Huxley model and the spike response model(seeequation2.2) inasimulationof spikeactivityover 100,000ms (i.e., 100 s).In so doing, we have accepted a spike of the spike response model to be coincident with the corresponding spike of the Hodgkin-Huxley model if it arrives within a temporal precision of f 2 ms. 4.1 Coincidence Measure. As a measure of the rate of coincidence of spikes generated by the Hodgkin-Huxley equations and another model such as equation 2.3, we define an index r that is the number of coincidences minus the number of coincidences by chance relative to the total number of spikes produced by the Hodgkin-Huxley and the other model. More precisely, r is the fraction
Here Ncoincis the number of coincident spikes of Hodgkin-Huxley and the other model as counted in a single simulation run, NHHthe number of action potentials generated by the Hodgkin-Huxley equations, and N ~ R the M
Reduction of the Hodgkin-Huxley Equations
1029
number of spikes produced by model, say, equation 2.2. The normalization factor N restricts r to unity in the case of a perfect coincidence of the other model’s spike train with the Hodgkin-Huxley one (Ncoinc= Nsm = NHH). Finally, (Ncoinc) is the average number of coincidences obtained if the spikes of our model were generated not systematically but randomly by a homogeneous Poisson process. Hence r x 0, if equation 2.2 would produce spikes randomly. The definition 4.1 is a modification of an idea of Joeken and Schwegler (1995). In order to calculate (Ncoinc), we perform a gedanken experiment. We are given the numbers NHHand N s m and divide the total simulation time into K bins of 4 ms length each. Due to refractoriness, each bin contains at most one Hodgkin-Huxley spike; we denote the number of these bins by NHH. So ( K - NHH)bins are empty. We now distribute N s m randomly generated spikes among the K bins, each bin receiving at most one spike. A coincidence occurs each time a bin contains a Hodgkin-Huxiey spike and a random spike. The probability of encountering Ncoinccoincidences is thus hypergeometrically distributed, that is, p(Ncoinc) = (Ns~!~omc) / with mean (Ncoinc) = NSRM NHH/ K. To see why, we use a simple analogy. Imagine an urn with NHHblack balls and (K - NHH)white balls, and perform a random sampling of N s balls ~ without replacement. The number of black balls drawn from the urn corresponds to the number of coincidences in the original problem. This setup is governed by the hypergeometric distribution (Prohorov & Rozanov, 1969;Feller, 1970).In passing, we note that dropping the refractory side condition leads to a binomial distribution and, hence, to the same result for (Ncoinc). Using the correct normalization, N = 1 - NHH/ K, we thus obtain a suitable measure: r has expectation zero for a purely random spike train, yields unity in case of a perfect coincidence of the spike response model’s spike train with the Hodgkin-Huxley one, and is bounded by -1 from below. A negative r hints at a negative correlation. Furthermore, r is linear in the number of coincidences and monotonically decreasing with the number of erroneous spikes. That is, r decreases with increasing NsRM while Ncoincis kept fixed or with decreasing Ncoincwhile N,, is kept fixed. While simulating the spike response model, we have found that due to the long-lasting afterpotential, a single misplaced spike causes subsequent spikes to occur with high probability at false times too. Since this is not a numerical artifact but a well-known and even experimentally observed biological effect (Mainen & Sejnowski, 1995),we have adopted the following procedure in order to eliminate this type of problem from the statistics. Every time the spike response model fails to reproduce a spike of the HodgkinHuxley spike train, we note an error and add the kernel q at the position where the original Hodgkin-Huxley spike occurred. Analogously, we count an error and omit the afterpotential if the spike response model produces a spike where no spike should occur. In case of a coincidence between the
(2:c)
(Nsk),
1030
Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen
Hodgkin-Huxley spike and a spike of the spike response model, we take no action; we place the ‘I kernel at the position suggested by the threshold criterion. Without correction, I- would decrease by at most a few percentage points, and the standard deviation would increase by a factor of two. 4.2 Low Firing Rate. The results of the simulations for low mean firing rates are summarized in Table 2. Due to the large interspike intervals, effects from spikes in the past are rather weak. Hence we can ignore the dependence of the c kernels on the former firing times and work with cA1)(s)instead of ci”(a; s). We do, however, take into account the kernel ql(s). Higherorder kernels co(2),co(3) give a significant improvement over an approximation with only. This reflects the importance of the nonlinear interaction of input pulses. In a third-order approximation, the spike response model reproduces 85 percent of the Hodgkin-Huxley spikes correctly (see Table 2). 4.3 High Firing Rate. At higher firing rates, the influence of the most recent output spike is more important than the nonlinear interaction between input pulses. We therefore use the kernel cil) but neglect higher-order terms with ei2), cj3), . . . in equation 2.2. Figure 7 gives the results of the simulations with different mean firing rates. As for Table 2, the threshold 19 and the time S between the formal threshold crossing time tf and the maximum of the action potential have been optimized in a separate run over 10 s. For this optimization run we have used an input current with (T = 3 pA/cm2, corresponding to a mean firing rate of about 33 Hz; the maximum mean firing rate is in the range of 70 Hz. For firing rates above 30 Hz, the single-variable model reproduces about 90 percent of the Hodgkin-Huxley spikes correctly. This is quite remarkable since we have used only the first-order approximation,
J
and neglected all higher-order terms. If we neglected the influence of the last spike, we would end up with a coincidence rate of only 73 percent, even in a second-order approximation using kernels ch’) and c 8 ’ . Closer examination of the simulation results for various mean firing rates reveals a systematic deviation of the mean firing rate of the spike response model from the mean firing rate of the Hodgkin-Huxley equations. More precisely, the spike response model generates additional spikes where no spike should occur, and it misses fewer spikes if the standard deviation of the input current is increased-which is quite natural (Mainen & Sejnowski, 1995). We have performed the same test with two versions of traditional inte-
Reduction of the Hodgkin-Huxley Equations
1031
Table 2: Simulation Results of the Spike Response Model Using the Volterra Kernels 661'. . . . ,E,,(3)
First Second Third
4.00 4.70 5.80
2.85 91 f 8 71 f 7 2 O f 5 1 6 f 3 0.788f0.030 2.75 87% 11 74f8 1 3 f 3 1 3 f 2 0.845f0.016 2.50 88f10 7 6 f 8 1 3 f 3 11 f 3 0.858f0.019
Note: The table gives the number of spikes produced by the spike response model (NsRM), the number of coincidences (Ncoinc),the number of spikes produced by the spike response model where no spike should occur (Nwrong),and the number of missing spikes (Nrnisd). The numbers give the average and standard deviation after 10 tuns with different realizations of the input current and a duration of 10 s per tun. The numbers should be compared with the number of spikes in the Hodgkin-Huxley model, NHH = 87 f 8. The coincidence rate r has been defined by equation 4.1.For random gambling we would obtain r zz 0. The parameters (9 (threshold) and 6 (time from threshold crossing to maximum of the action potential) have been determined by optimizing the results in a separate run over 10 s.
grate-and-fire models: a simple integrate-and-fire model with reset potential u0 and time constant r , and an integrate-and-fire model with timedependent threshold derived from the rpkernel found by the methods described in section 2.4--29(a) = -ql(a) for CJ > 5ms and B ( a ) = 00 for CJ < 5 ms. The results are summarized in Figure 8. We have optimized the parameters r , 190, and uo with a fluctuating input current as already explained and found optimal performance for a time constant of r = 1ms. This surprisingly short time constant should be compared to the time constants of the 6;') kernel. Indeed, if the time since the last spike is shorter than about 10 ms (see Figure lb), the 6;'' kernel approaches zero within a few milliseconds. The optimized reset value was uo = -3 mV and the threshold 190 = 3.05mV respectively 00 = 2.35mV for the model with respectively without time-dependent threshold. 4.4 Response to Current Step Functions. Constant input currents have been a paradigm to neuron models ever since Hodgkin and Huxley (1952), although a constant current is very far from reality and one may wonder whether it offers a sensible criterion for the behavior of neurons producing spikes and, thus, responding to fluctuating input currents. Nevertheless, we want to study the response of our model to current steps and compare it with that of the original Hodgkin-Huxley model. As a consequence of the oscillatory behavior of the first-order kernel E;') (see Figure lb), our model exhibits a phenomenon known as inhibitory re-
1032
Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen
11.0
2.25
2.5
input standard deviation o / pA cm-2 Figure 7: Simulation results for different mean firing rates using the first-order improved kernel 6 ; ’ ’ only. The black bars indicate the number of spikes (leftaxis) produced by the Hodgkin-Huxley model; the neighboring bars correspond to the number of spikes of the spike response model. The gray shading reflects the coincident spikes. All numbers are averages of 10 runs over 10 s with different realizations of the input current. The mean firing rate varies between 23 and 40 Hz. The error bars are the corresponding standard deviations. The line in the upper part of the diagram indicates the coincidence rate r (right axis) as defined by equation 4.1.The parameters 19 = 4.7mV and 6 = 2.15ms have been optimized in a separate run with an input current with standard deviation o = 3 p A cm-’.
bound. Suppose we have a constant inhibitory (i.e., negative) input current for time t < 0, which is turned off suddenly at t = 0. We can easily calculate the membrane potential from equation 1.1 if we assume that there are no spikes in the past. The resulting voltage trace exhibits a significant oscillation (see Figure 9). If the current step is large enough, the positive overshooting triggers a single spike after the release of the inhibition. Since equation 1.1is linear in the input, the amplitude of the oscillatory response of the membrane potential in Figure 9 is proportional to the height of the current step but independent of the absolute current values before or after the step. This is in contrast to the Hodgkin-Huxley equations, where amplitude and frequency of the oscillations depend on both step height and absolute current values (Mauro, Conti, Dodge, & Schor, 1970). The limitations of a voltage threshold for spike prediction can be investi-
Reduction of the Hodgkin-Huxley Equations
L
I
-
1033
i i&f*
SRM
Figure 8: Comparison of various threshold models. The diagram shows the coincidence rate r defined in section 4.1 for a simple integrate-and-fire model with time constant r = 1ms and reset potential ug = -3 mV (i&f),an integrateand-fire model with time-dependent threshold (i&f*),and the spike response model (SRM).The error bars indicate the standard deviation of 10 simulations over 10 s each with a fluctuating input current with u = 3wA cm-*.
gated by systematically studying the response of the model to current steps (see Figure 10). For t < 0 we apply a constant current with amplitude 11. At t = 0 we switch to a stronger input 12 > 11. Depending on the values of the final input current 12 and the step height AI = 12 - 11 we observe three different modes of behavior in the original Hodgkin-Huxley model, as is well known (Cronin 1987).For small steps and low final input currents, the membrane potential exhibits a transient oscillation but no spikes (inactive phase 1). Larger steps can induce a single spike immediately after the step (single-spike phase S).Increasing the final input current further leads to repetitive firing (R). Note that repetitive firing is possible for I > 6 PA cm-’, but firing must be triggered by a sufficiently large current step AI. Only for currents larger than 9.47 FA/cm2 is there autonomous firing independent of the current step. We conclude from the step response behavior that there are two different threshold paradigms: a single-spike threshold (dashed line in Figure 10) and a threshold for repetitive firing (solid line). We want to compare these results with the behavior of our threshold model. The same set of step response simulations has been performed with the spike response model. Comparing the threshold lines in the (12, AI) diagram for the Hodgkin-Huxley model and the spike response model, we can state a qualitative correspondence. The spike response model exhibits the same three modes of response as the Hodgkin-Huxley equations. Let us have a closer look at the two thresholds: the single-spike threshold and the repetitive firing threshold. The slope of the single-spike threshold (dashed
1034
Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen
-1"
0
-5
5
10
15
20
25
3
t/ms Figure 9: Response of the spike response model to a current step. An inhibitory input current that is suddenly turned off produces a characteristic oscillation of the membrane potential. The overshooting membrane potential can trigger an action potential immediately after the inhibition is turned off. The amplitude of the oscillation is proportional to the height of the current step. Here, the current is switched from -1 pA cm-' to 0 at time t = 0.
line in Figure 10)is far from perfect if we compare the lower and the upper parts of the figure. The slope is determined by the mean Jds chl)(s) of the linear response kernel and is therefore fixed as long as we stick to the membrane voltage as the relevant variable used in applying the threshold criterion. We now turn to the threshold of repetitive firing. The position of the vertical branch of the repetitive-firing threshold (solid line in Figure 10) is shifted to lower current values as compared to the Hodgkin-Huxley model. Consequently, the gain function of the spike response model is shifted to lower current values too (see Figure 11).The repetitive-firing threshold is directly related to the (free) threshold parameter 1.9 and can be moved to larger current values by increasing 19. Using 19 = 9 mV instead of I.9 = 4.7mV results in a reasonable fit of the Hodgkin-Huxley gain function (see Figure 11).However, a shift of I.9 would also shift the single-spike threshold of Figure 10 to larger current values, and the triggering of single spikes at low currents would be made practically impossible. The value for the threshold parameter d = 4.7mV found by optimization with a fluctuating input as described in section 4.3 is therefore a compromise. It follows from Figure 10 that there is not a strict voltage threshold for spiking in the Hodgkin-Huxley model. Nevertheless, our model qualitatively exhibits all relevant phases, and, with the fluctuating input scenario, there is even a fine quantitative agreement.
,A1)
1035
Reduction of the Hodgkin-Huxley Equations
lime / ms
6 100 50
'E4.
S
Q, 1
> 6 Em
.
0
------__
7 +-
g2.
*
\
I 3 O
0
50 time / m\
100
R ,:
F=i 0
0-
50
100
current I :/ FA c i '
;":1 I
0' 0
50
Id0
_____
6
I
\
84.
-. a
4,
\
\
100
*
50
\ \s
0
R
\\
\
g2.
0 100 50
I 0time / m%
0
0
50
100
current I ?IpA cm-
Figure 10: Comparison of the response to current steps of the Hodgkin-Huxley equations (upper graph) and the spike response model (lower graph). At time t = 0 the input current is switched from II to 4, and the behavior of the models is studied in dependence on the final current Z2 and the step height AI = 12 I,. The figures in the center indicate in the manner of a phase diagram three different regimes: (I) the neuron remains inactive; (S) a single spike is triggered; (R) repetitive firing is induced. We have chosen four representative pairs of values of Iz and A I (marked by dots in the phase diagram) and plotted the corresponding time course of the membrane potential on the left and the right of the main diagram so that the responses of Hodgkin-Huxley model and spike response model can be compared. The parameters for the spike response model are the same as those detailed in the caption to Figure 7.
Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen
1036
125 -’ 100-
2
75-
1
% 25 0
I
;
1
I
I
i
Figure 11: Comparison of the gain function (output firing rate for constant input current I as a function of I) of the spike response and the Hodgkin-Huxleymodel. The threshold parameter B = 4.7mV found by the optimization procedure described in section43givesa gain functionof the spike response model (dashed line) that is shifted toward lower current values as compared to the gain function of the Hodgkin-Huxley model (solid line). Increasing the threshold parameter to B = 9 mV results in a reasonable fit (dotted line).
5 Discussion In contrast to the complicated dynamics of the Hodgkin-Huxley equations, the spike response model in equation 2.2 has a straightforward and intuitively appealing interpretation. Furthermore, the transparent structure of the model has a great number of advantages. The dynamical behavior of a single neuron can be discussed in simple but biological terms of threshold, postsynaptic potential, refractoriness, and afterpotential. As a first example, we discuss the refractory period. Refractoriness of the Hodgkin-Huxley model leads to two distinct effects. On the one hand, we have shunting effects-a reduced responsiveness of the membrane to input spikes during and immediately after the spike-that is related to absolute refractoriness and is expressed in our model by the first argument of ci”. In addition, the kernel 71 induces an afterpotential corresponding to a relative refractory period. In the Hodgkin-Huxley model, the afterpotential corresponds with hyperpolarization and lasts for about 15 ms, followed by a rather small depolarization. A different form of the afterpotential with a significant depolarizing phase would lead to intrinsic bursting (Gerstner & van Hemmen, 1992).So the neuronal behavior is easily taken care of by the spike response model. As a second example, let us focus on the temperature dependence of the various kernels. Temperature can be included in the Hodgkin-Huxley equa-
Reduction of the Hodgkin-Huxley Equations
1037
tions by multiplying the right-hand side of equation 3.2 by a temperaturedependent factor { (Cronin, 1987),
< = exp [ln(3.0) (T - 6.3) / 10.01,
(5.1)
with the temperature T in degrees Celsius. Equation 3.1 remains unmodified. In Figure 12 we have plotted the resulting c;’) and q1 kernels for different temperatures. Although the temperature correction affects only the equations for m, h, and n, the overall effect can be approximated by a time rescaling, since the form of both kernels is not modified but stretched in time with decreasing temperature. Apart from the transparent structure, the single-variable model is open to a mathematical analysis that would be inconceivable for the full HodgkinHuxley equations. Most important, the collective behavior of a large network of neurons can now be predicted if the form of the kernels q and c ( l ) is known (Gerstner & van Hemmen, 1993; Gerstner, 1995; Gerstner, van Hemmen, & Cowan, 1996).In particular, the existence and stability of collective oscillations can be studied by a simple graphical construction using the kernels and c ( l ) .It can be shown that a collective oscillation is stable if the sum of the postsynaptic potentials is increasing at the moment of firing. Otherwise it is unstable (Gerstner et al., 1996).Furthermore, it can be shown that in a fully connected and homogeneous network of spike response neurons with a potential described by equation 1.1, the state of asynchronous firing is almost always unstable (Gerstner, 1995).Also in simulations of a network of Hodgkin-Huxley neurons, a spontaneous breakup of the asynchronous firing state and a small oscillatory component in the mean firing activity have been observed (Wang, personal communication, 1995). In summary, we have used the Hodgkin-Huxley equations as a wellstudied reference model of spike dynamics and shown that it can indeed be reduced to a threshold model. Our methods are more general and can also be applied to more elaborate models that involve many ion currents and a complicated spatial structure. Furthermore, an analytic evaluation of some of the kernels (see the appendix for mathematical details) is possible. We have also presented a numerical algorithm to determine the response kernels that, in contrast to the Wiener method, can be determined quickly and easily. In fact, the spike response method we have used in our analysis can be seen as a systematic approach to a reduction of a complex system to a threshold model.
Appendix In this appendix we treat Volterra’s theory (Volterra, 1959)in the context of inhomogeneous differential equations, indicate under what conditions and how the kernels of the Volterra series can be computed analytically, and isolate the mathematical essentials of our own approach.
Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen
1038
2
3
4
8
6
10
12
0.
-
0. 0.
-0
w
0.
-0. 0
5
10
15
20
tlms (b)
Figure 12: Temperature dependence of the kernels corresponding to the Hodgkin-Huxley equations. Both ql kernel (a) and 6;’) kernel (b)show apart from a stretching in time no significant change in form if temperature is decreased from 15.0 (dotted line) and 10.0 (dashed line) to 6.3 (solid line) degrees Celsius.
A.l Volterra Series. We start by studying the ordinary differential equation
where j is a given function of time and X denotes differentiation of x ( t ) with respect to t. We require that for j = 0, x = 0 should be an attractive fixed point with a sufficiently large basin of attraction so that the system returns
Reduction of the Hodgkin-Huxley Equations
1039
to equilibrium if the perturbation j ( t ) is turned off. Furthermore, we require f and j to be continuous and bounded and f to fulfill a Lipschitz condition. Finally, j E C2(R)should have compact support so that x E C2(R). Under these premises there exists a unique solution x, with x ( -m) = 0 for each input j E A c C2(R).Hence, the time course of the solution x is a function' of the time course of the perturbation j, formally,
F : A c C2(R) += C2(R), j H F[j] = x with x(t) - f ( x ( t ) ) = j ( t ) , V t E R. (A.2) Let us suppose that the function F is analytic in a neighborhood of the constant solution x = 0. For a precise definition of the notion of analyticity and conditions under which it holds, we refer to Thomas (1996). In case of analyticity there is a Taylor series (Dieudonnk, 1968; Dunford & Schwartz, 1958; Kolmogorov & Fomin, 1975; Hille & Phillips, 1957) for F,
F[jl = 0
1 + F'[Ol[jl + -F"[Ol[j, jl + . . . 2!
(A.3)
This series is convergent with respect to the II.112-norm, that is, F[j](t) = x ( t ) for almost all t E R. Each derivative F(")[O]in equation A.3 is a continuous nlinear mapping from (C2(R))"to C2(R).Hence, F'"'[O][.](t): (C2(IR))"-+ R is a continuous n-linear functional for almost all (fixed) t E R. We know that every such functional has a Volterra-like integral representation (Volterra, 1959; Palm & Poggio, 1977; Palm, 1978),
In the most general case the kernels c:") are distributions. We will see, however, that the kernels describing equation A.2 are (almost everywhere) continuous functions from C2(Rt1). Combining equations A.3 and A.4 yields the Volterra series of F ,
c ( 2 ) ( t l ,t 2 ) j(t-tl) j(t-t2)
'
+
The bracket convention is such that (..) denotes the dependenceon a real or complex scalar, and I..]denotes the dependence on a function.
1040
Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen
+ ...
(A.5)
Because we expand around the constant solution x = 0, the 6 kernels depend not on the time t but on the difference ( t - t*), t‘ being the time when we have stated the initial condition. In this subsection, we calculate the solution for the initial condition x(t’ = -00) = 0 and the 6 kernels do not depend on t at all. We have therefore dropped the index t from 6;”)as it occurs in equation A.4. As is discussed in section A.3, dropping the index is possible only if no spike occurs in the past. We can easily generalize this formalism to systems of differential equations,
and obtain
‘J
+-2!
s
dti dtz tl*’(tl, t 2 ) j ( t - t l ) j ( t - t 2 )
+. . .
As opposed to the lower index t in equation A.4, and the lower index p in equation 2.3, the subscript k in equation A.7 denotes the kth vector compo-
nent of the system throughout the rest of this appendix. A.2 Calculation of the Kernels. We want to prove that the kernels CAI), ch2), . . . can be calculated explicitly for fairly arbitrary systems of differential
equations. In particular, it is possible to obtain analytic expressions for the kernels in terms of the eigenvalues of the linearizationof the N-dimensional system of differential equations we start with. Suppose f in equation A.6 is analytic in the neighborhood of x = 0 so that we can expand f in a Taylor series around x = 0. In doing so, we use the Einstein summation convention. We substitute the Taylor series into equation A.6,
realize that the derivatives are evaluated at x = 0, and switch to Fourier transforms,
Reduction of the Hodgkin-Huxley Equations
1041
(-4.9)
The Fourier transform of the Volterra series (see equation A.7) is
We substitute this into equation A.9 and obtain a polynomial (in a convolution algebra) in terms of j ( w ) ,
Here we have defined the abbreviations
ck
= 8k.l ,
and (A.13)
Equation A . l l holds for every function j . The coefficients of j ( w ) therefore have to vanish identically, and we are left with a set of linear equations for the kernels, which can be solved consecutively,
(A.14)
Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen
1042
Note that we have to invert only one matrix, ckl(w), in order to solve equation A.14 for all kernels. The inverse of ckl(w) is (A.15)
where C,,,(w) is the adjunct matrix of ckl(w) defined in equation A.12. Solving equation A.14, we obtain for c ; ' ( w l , . . . , w,) an expression with a denominator that is a product of the characteristic polynomial p ( w ) := det ckl(w), evaluated at different frequencies w . For instance, solving equation A.14 for ck(2)(w1, cry) yields a denominator p(w1)p(w2)p(w1 q), and the denominator in c i 3 ' ( w l , q , w3) is p(w1)p(w2)p(w3)p(w1 w3) p(w2 w3)p(wl q w3), and so forth. If we know the roots of p(w), that is, the eigenvalues of the system of differential equations, we can factorize the denominator and easily determine the residues of c r ) ( w l , . . . , w,) and thus derive an analytic expression for the inverse Fourier transform of
+
+ +
+
+
(n) ck (w13...,wn)!
The coefficients uklnl E C are sums and differences of the N eigenvalues hk of the N-dimensional system of differential equations,
The computational difficulties are reduced to the calculation of the N roots of a polynomial of degree N where N is the system dimension; see equation A.6. A.3 Limitations. In most cases, the function F is not an entire function; in other words, the series expansion around the constant solution x = 0 does not converge for all perturbations j (Thomas, 1996). For the Hodgkin-Huxley equations, we have found numerically that the truncated series expansion gives a fine approximation to the true solution only as long as the input current keeps well below the spiking threshold. This is not too surprising, since the solution of Hodgkin-Huxley equations exhibits an apparent discontinuity at spiking threshold. Consider a family of input functions given by
jr :t
HjoO(t
- s)O(t
+
5).
(A.18)
These are square pulses with height j o and duration 2s. There is a threshold jlY(r) and a small constant S with 6 / j I 9 ( s )<< 1 so that for j o < j,Y(s) - S
Reduction of the Hodgkin-Huxley Equations
1043
+
no spike occurs, whereas for jo > j a ( 7 ) S at least one spike is triggered within a given compact interval (see Figure 2). Thus, the two-norm of the derivative, IIF’[jr]112 =Z llF[jr S ] - F [ j , - 8 ] l l 2 / 28, takes large values as j o approaches the value j o ( 7 ) , and we do not expect the series expansion to converge beyond that point. In the body of this article, we introduced the response function q in order to address this problem. The key advantage of 171 and the ”improved” kernelsci’), ci2’,. . . is that we no longer expand around theconstant solution x = 0 but around a solution of the Hodgkin-Huxley equations that contains a single spike at t = tf. In the context of equation A.5, we have argued that the c kernels do not depend on the absolute time. However, since the new zero-order approximation u ( t ) = q l ( t - tf) contains a spike, homogeneity of time is destroyed and the improved c kernels depend on (t - tf) where tf is the latest firing time. Unfortunately, there is no analytic solution for the Hodgkin-Huxley equations with spikes. We therefore have to, and did, resort to numerical methods in order to determine the kernel c r ) in the neighborhood of a spike. In order to construct approximations of solutions with more than one spike, we have to concatenate these single-spike approximations. We can then exploit the fact that the truncated series expansion exhibits rather prominent peaks in the membrane potential at positions where the HodgkinHuxley equations would produce a spike. The task is to decide from the truncated series expansion which of the peaks in the approximated membrane potential belong to a spike and which do not. In the main body of the article, we investigated how well this can be done by using a threshold criterion.
+
Acknowledgments This work was supported by the Deutsche Forschungsgemeinschaft (DFG) under grant numbers He 1729/2-2 and 8-1. References Abbott, L. F., & Kepler, T. B. (1990). Model neurons: From Hodgkin-Huxley to Hopfield. In L. Garrido (Ed.),Statistical Mechanics ofNeural Networks. Berlin: Springer-Verlag. Abbott, L. F., & Vreeswijk, C. van. (1993). Asynchronous states in a network of pulse-coupled oscillators. Phys. Rev. E 48:1483-1490. Av-Ron, E., Parnas, H., & Segel, L. A. (1991). A minimal biophysical model for an excitable and oscillatory neuron. Biol. Cybern. 65:487-500. Cole, K. S., Guttman, R., & Bezanilla, F. (1970). Nerve membrane excitation without threshold. Proc. Nafl. Acad. Sci. 65(4):884-891. Cronin,J. (1987).Mathematical aspects ofHodgkin-Huxley theory. Cambridge:Cambridge University Press.
1044
Werner M. Kistler, Wulfram Gerstner, and J. Leo van Hemmen
Dieudonne, J. (1968).Foundations of modern analysis. New York: Academic Press. Dunford, N., & Schwartz, J. T. (1958).Linear operators, Part I: General theory. New York: Wiley. Ekeberg, O., Wallen, P., Lansner, A., Traven, H., Brodin, L., & Grillner, S. (1991). A computer based model for realistic simulations of neural networks. Biol. Cybern. 6581-90. Feller, W. (1970).A n Introduction to Probability Theory and Its Applications (3rd ed., Vol. 1).New York Wiley. FitzHugh, R. (1961).Impulses and physiological states in models of nerve membrane. Biophys. 1, 1:445-466. Gerstner, W. (1991).Associative memory in a network of ”biological” neurons. In R. I? Lippmann, J. E. Moody, & D. S. Touretzky (Eds.),Advances in Neural Information Processing Systems 3 (pp. 84-90). San Mateo, CA: Morgan Kaufmann. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E 51:738-758. Gerstner, W., & Hemmen, J. L. van. (1992).Associative memory in a network of “spiking” neurons. Network 3:139-164. Gerstner, W., & Hemmen, J. L. van. (1993). Coherence and incoherence in a globally coupled ensemble of pulse emitting units. Phys. Rev. Lett. 71(3):312315. Gerstner, W., Hemmen, J. L. van, & Cowan, J. D. (1996).What matters in neuronal locking? Neural Comp. 8:1689-1712. Hille, E., & Phillips, R. S. (1957).Functional analysis and semi-groups. Providence, RI:American Mathematical Society. Hodgkin, A. L., & Huxley, A. F. (1952).A quantitative description of ion currents and its applications to conduction and excitation in nerve membranes. 1. Physiol. (London) 117:500-544. Hopfield, J. J., & Herz, A. V. M. (1995). Rapid local synchronization of action potentials: Towards computation with coupled integrate-and-fire networks. Proc. Natl. Acad. Sci. U S A 92:6655. Joeken, S., & Schwegler, H. (1995). Predicting spike train responses of neuron models. In M. Verleysen (Ed.),Proc. 3rd European Symposium on Artificial Neural Networks (pp. 93-98). Brussels: D facto. Kepler, Thomas B., Abbott, L. F., & Marder, E. (1992).Reduction of conductancebased neuron models. Biol. Cybern. 66:381-387. Kernell, D., & Sjoholm, H. (1973). Repetitive impulse firing: Comparison between neuron models based on voltage clamp equations and spinal motoneurons. Acta Physiol. Scand. 874&56. Koch, C., Bernander, O., & Douglas, R. J. (1995).Do neurons have a voltage or a current threshold for action potential initiation? 1. Comp. Neurosci. 2:63-82. Kolmogorov, A. N., & Fomin, S. V. (1975).Reelle Funktionen und Funktionalanalysis. Berlin: VEB Deutscher Verlag der Wissenschaften. Korenberg, M. J. (1989).A robust orthogonal algorithm for system identification and time-series analysis. Biol. Cybern. 60267-276. Lapicque, L. (1907).Recherches quantitatives sur l’excitation electrique des nerfs traitee comme une polarisation. 1. Physiol. Pathol. Gen. 9620-635.
Reduction of the Hodgkin-Huxley Equations
1045
Mainen, Z. F., & Sejnowski, T. J.(1995).Reliability of spike timing in neocortical neurons. Science 268:1503-1506. Mauro, A., Conti, F., Dodge, F., & Schor, R. (1970).Subthreshold behavior and phenomenological impedance of the giant squid axon. 1.Gen. Physiol. 55:497523. Palm, G. (1978). On representation and approximation of nonlinear systems. Biol. Cybernetics 31:119-124. Palm, G., & Poggio, T. (1977). The Volterra representation and the Wiener expansion: Validity and pitfalls. SIAM 1. Appl. Math. 33(2):195216. Palm, G., & Poggio, T. (1978). Stochastic identification methods for nonlinear systems: An extension of the Wiener theory. SlAM 1. Appl. Math. 34(3):524534. Prohorov, Yu. V., & Rozanov, Yu. A. (1969).Probability. Berlin: Springer-Verlag. Rinzel, J. (1985).Excitation dynamics: Insights from simplified membrane models. Federation Proc. 44:2944-2946. Rinzel, J., & Ermentrout, G. B. (1989).Analysis of neuronal excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in Neuronal Modeling. Cambridge, MA: MIT Press. Thomas, E. G. F. (1996).Q-analytic solution operatorsfor non-linear differential equations (Reprint). Department of Mathematics, University of Groningen. Traub, R. D., Wong, R. K. S., Miles, R., & Michelson, H. (1991). A model of a CA3 hippocampal pyramidal neuron incorporating voltage-clamp data on intrinsic conductances. 1. Neurophysiol. 66:635450. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Patterns of synchrony in inhomogeneous networks of oscillatorswith pulse interaction. Phys. Rev. Lett. 72 :1281-1283. Usher, M., Schuster, H. G., & Niebur, E. (1993). Dynamics of populations of integrate-and-fireneurons, partial synchronization and memory. Neural Computat ion 5570-586. Volterra, V. (1959). Theory of Functionals and of integral and Integro-Differential Equations. New York: Dover. Wiener, N. (1958). Nonlinear Problems in Random Theory. Cambridge, MA: MIT Press. Wilson, M. A,, Bhalla, U. S., Uhley, J. D., & Bower, J. M. (1989). Genesis: A system for simulating neural networks. In D. Touretzky (Ed.), Advances in Neural Information Processing Systems (pp. 485-492). San Mateo, CA: Morgan Kaufmann. Yamada, W. M., Koch, C., & Adams, I? R. (1989). Multiple channels and calcium dynamics. In C. Koch & I. Segev (Eds.), Methods in Neuronal Modeling: From Synapses to Netzuorks. Cambridge, MA: MIT Press. Received April 17,1996;accepted October 28,1996.
Communicated by Bruce Knight
Noise Adaptation in Integrate-and-Fire Neurons Michael E. Rudd Lawrence G. Brown Department of Psychology, Johns Hopkins University, Baltimore, Maryland 21218, U.S.A.
The statistical spiking response of an ensemble of identically prepared stochastic integrate-and-fire neurons to a rectangular input current plus gaussian white noise is analyzed. It is shown that, on average, integrateand-fire neurons adapt to the root-mean-square noise level of their input. This phenomenon is referred to as noise adaptation. Noise adaptation is characterized by a decrease in the average neural firing rate and an accompanying decrease in the average value of the generator potential, both of which can be attributed to noise-induced resets of the generator potential mediated by the integrate-and-fire mechanism. A quantitative theory of noise adaptation in stochastic integrate-and-fire neurons is developed. It is shown that integrate-and-fire neurons, on average, produce transient spiking activity whenever there is an increase in the level of their input noise. This transient noise response is either reduced or eliminated over time, depending on the parameters of the model neuron. Analytical methods are used to prove that nonleaky integrate-and-fire neurons totally adapt to any constant input noise level, in the sense that their asymptotic spiking rates are independent of the magnitude of their input noise. For leaky integrate-and-fire neurons, the long-run noise adaptation is not total, but the response to noise is partially eliminated. Expressions for the probability density function of the generator potential and the first two moments of the potential distribution are derived for the particular case of a nonleaky neuron driven by gaussian white noise of mean zero and constant variance. The functional significance of noise adaptation for the performance of networks comprising integrate-and-fire neurons is discussed. 1 Introduction In this article, we analyze the statistical spiking behavior of stochastic integrate-and-fire neurons, that is, integrate-and-fire neurons that receive probabilistic input. We consider the rather general case in which the random fluctuations in the neural input are modeled as gaussian white noise. We show that neurons whose range of allowable generator potential states is unbounded in the negative domain exhibit a dynamic and probabilistic Neural Computation 9, 1047–1069 (1997)
c 1997 Massachusetts Institute of Technology °
1048
Michael E. Rudd and Lawrence G. Brown
adaptation to the root-mean-square noise level of their input. This suggests a technique by which noise suppression can be carried out in artificial neural networks consisting of such neurons and raises the question of whether a similar noise adaptation mechanism exists in actual biological neural networks. The stochastic integrate-and-fire model of neural spike generation was introduced about thirty years ago (Gerstein, 1962; Gerstein & Mandelbrot, 1964) and has since been the subject of numerous theoretical papers (e.g., Calvin & Stevens, 1965; Capocelli & Ricciardi, 1971; Geisler & Goldberg, 1966; Gluss, 1967; Johannesma, 1968; Stein, 1965, 1967) and reviews (Holden, 1976; Ricciardi, 1977; Tuckwell, 1988, 1989). Because the spike generation model is not a new idea, many users of the model probably assume that its properties are well—if not completely—understood. This is far from the case. Analytical results have generally proved difficult to obtain, and the main results in the literature consist of expressions for the density function and moments of the spike interarrival times of the neuron (Holden, 1976; Ricciardi, 1977; Tuckwell, 1988, 1989). In fact, not even a closed-form solution for the interarrival time density is known for the biologically important case of an integrate-and-fire model that includes a membrane decay (Tuckwell, 1988, 1989). Some behavioral properties of the stochastic integrate-and-fire neuron, such as the firing rate, are most naturally defined in terms of the statistics of an ensemble of identically prepared neurons. For example, in singlecell recording studies of individual neurons, the firing rate dynamics are often illustrated by a response histogram, which is a tabulation of many probabilistic spiking events as a function of the time after stimulus onset. The response histogram represents a statistical sample of spike counts taken from an ensemble of identically prepared neurons. In this article, we present some new analytic results that add considerable insight into the theoretical ensemble spiking statistics of integrate-and-fire neurons. Specifically, we consider the case of an integrate-and-fire neuron whose generator potential is unbounded in the negative domain, and derive an expression for the timedependent average firing rate of the neuron, assuming an input consisting of a rectangular current plus gaussian white noise. We demonstrate that the neuron has the surprising property of adapting, on average, to the level of the noise in its input. We then analyze the evolution of the probabilistic state of the generator potential to discover the mechanism behind this noise adaptation. 2 The Stochastic Integrate-and-Fire Spike Generation Model Let U(t) stand for the generator potential of the integrate-and-fire neuron. According to the standard mathematical formalization of integrate-andfire neurons (Holden, 1976; Ricciardi, 1977; Tuckwell, 1988, 1989), we can write a stochastic differential equation that describes the time evolution
Noise Adaptation
1049
of the generator potential. In each small time step, the potential U(t) is increased by an amount that depends on the mean input rate µ(t) and the duration of the step dt. The potential is also perturbed by gaussian white noise with variance per unit time σ 2 (t). Finally, a percentage αdt of the generator potential decays away as a result of a membrane leak. Thus, the generator potential obeys the stochastic differential equation dU(t) = (−αU(t) + µ(t))dt + σ (t)dW(t),
(2.1)
where W(t) is a standard Wiener process with mean zero and unit variance per unit time. As is well known, equation 2.1 describes the evolution of a stochastic diffusion process, known in the mathematical literature as the Ornstein-Uhlenbeck process (OUP) (Uhlenbeck & Ornstein, 1930; Tuckwell, 1988, 1989). The OUP is a type of transformed Brownian motion process. Whenever the generator potential of the integrate-and-fire neuron exceeds the value of a neural firing threshold θ, the neuron emits a single neural pulse, and the generator potential is reset to zero. The neural output signal y(t) thus evolves according to the following formula (Chapeau-Blondeau, Godivier, & Chambet, 1996): If U(t) = θ then y(t) = δ(t0 − t), U(t) = 0; else y(t) = 0.
(2.2)
Equations 2.1 and 2.2 represent the standard formulation of the integrateand-fire model. Note, however, that the effect of the resets on the probabilistic evolution of the generator potential is not explicitly taken into account in the standard formulation of the problem (see equation 2.1). Technically, equation 2.1 describes only the evolution of the generator potential between successive spikes. To develop a probabilistic model of the generator potential that takes the resets into account and is therefore not limited to time intervals in which no spikes occur, we modify equation 2.1 as follows: dV(t) = (−αV(t) + µ(t))dt + σ (t)dW(t) − θ dN(t),
(2.3)
where dN(t) is the number of resets in the interval dt; we now denote the potential by a new symbol, V(t), in order to distinguish clearly the behavior of the generator potential with resets (see equation 2.3) from the behavior of the OUP U(t) (see equation 2.1). To complete our formalization of the stochastic integrate-and-fire neuron, we may then write If V(t) = θ then y(t) = δ(t0 − t), V(t) = 0; else y(t) = 0.
(2.4)
Note that explicit consideration of the resets on the generator potential results in the appearance of an inhibitory term in the stochastic equation 2.3, for the time evolution of the potential that is not present in the standard equation for the generator potential (see equation 2.1). In fact, the only difference between equations 2.1 and 2.3 is the presence of an extra inhibitory
1050
Michael E. Rudd and Lawrence G. Brown
term in equation 2.3 that arises from the operation of the reset mechanism of the integrate-and-fire neuron. The inhibition resulting from the neural resets tends to reduce the potential compared to the hypothetical potential that would be produced in the absence of resets. Thus, the resets tend to hyperpolarize the neuron. We will show that this reset-induced hyperpolarization has important implications for the dynamics of the mean spiking rate of the stochastic integrate-and-fire neuron. Informal reasoning suggests that as the potential is progressively decreased by a series of resets while the threshold remains fixed, the neuron will fire less; thus, the reset mechanism will act to decrease the firing rate progressively until the separate influences that act to increase (input current) and decrease (resets) the generator potential exactly counterbalance one another. Once the effects of these opposing influences equilibrate, the firing rate will stabilize. This reasoning is directly checked in the next section by computing the the average spiking rate of the neuron as a function of time. 3 Firing Rate Dynamics In this section, we investigate the spiking response of the neuron to an input of constant mean rate µ and diffusion coefficient σ 2 that is turned on at t = 0. The neuron is initially at potential V(0). At time zero it is suddenly perturbed by a rectangular input current, to which is added gaussian white noise. For purposes of exposition, we will momentarily assume that the neuron has no membrane leak. In this special case, we can derive an expression for the time-dependent neural firing rate by using the well-known density function for the first passage time of a Brownian motion initially localized at the point (V(0), 0) to cross an arbitrary voltage level Va , which is one of the few analytic expressions that we have in our arsenal for understanding the behavior of the neuron. The density function is f (t) =
Va − V(0) −(Va −V(0)−µt)2 /2σ 2 t e √ σ 2π t3
(3.1)
(Karlin & Taylor, 1975). From this density we immediately obtain the first firing time density: f1 (t) =
θ − V(0) −(θ −V(0)−µt)2 /2σ 2 t e . √ σ 2π t3
(3.2)
More generally, it can be shown (Rudd, 1996) that the density for the time of the Nth neural spike is fN (t) =
Nθ − V(0) −(Nθ−V(0)−µt)2 /2σ 2 t e . √ σ 2π t3
(3.3)
Noise Adaptation
1051
It is easily seen that the average spiking rate is given by the infinite sum ∞ X dN(t) = fN (t). dt N=1
(3.4)
In Figure 1a, we plot the infinite series (3.4) for the parameter set V(0) = 0, µ = 150, θ = 50, and three different values of σ . Note that the firing rate has both a transient noise-dependent component and a sustained noiseindependent component. The initial, transient, firing rate is larger for larger values of the input noise. However, in the long run (as t → ∞), as indicated on the plot, dN(t)/dt → µ/θ , a value independent of σ . The asymptotic (sustained) firing rate can be derived by using a mathematical theorem (Parzen, 1962, p. 180), which relies on the fact that the number of spikes N(t) generated in the interval [0, t] is a renewal counting process. We first note that the average value of the random time ξ for the potential initialized at zero to cross the neural threshold is ξ¯ = θ/µ (Rudd, 1996) and then apply the theorem: 1 N(t) = . t→∞ t ξ¯ lim
(3.5)
Multiplying both sides of equation 3.5 by t and taking the time-derivative yields dN(t)/dt → ξ¯ −1 as t → ∞. In the specific case of the nonleaky integrate-and-fire model, dN(t)/dt → µ/θ as t → ∞. In what follows, we will refer to equation 3.5 as the renewal counting theorem. When µ = 0, ξ¯ is infinite, and the renewal counting theorem therefore does not apply. However, as µ → 0, dN(t)/dt → 0 as t → ∞, which suggests that the firing rate will tend to zero in the long run when the neuron is driven by gaussian white noise of constant variance and mean zero. The same conclusion is reached by noting that for µ = 0 and as t → ∞, every term in our infinite series expression (3.4) for the neural firing rate goes to zero. We further verify the conclusion by plotting in Figure 1b the spiking rate of a nonleaky integrate-and-fire neuron with µ = 0 driven by gaussian white noise of three different magnitudes. Our results to this point indicate that when an integrate-and-fire neuron is suddenly perturbed by a noisy input, the neuron produces a transient spiking response that is positively related to the noise fluctuation level; in the long run, the response to noise completely dies away, and the mean firing rate depends only on the deterministic component of the input. In the next section we will show that the tendency of the neuron’s response to noise to decay over time represents a form of stochastic neural adaptation. The noise response is eliminated by the reset-induced hyperpolarization of the generator potential. We refer to this phenomenon as noise adaptation. The word adaptation connotes both the idea of dynamic change and the idea that
1052
Michael E. Rudd and Lawrence G. Brown
Figure 1: Average firing rate of the integrate-and-fire neuron as a function of time, plotted for three different values of the general potential noise level σ . (a) Average firing rate of a nonleaky neuron with mean input rate µ = 150, calculated on the basis of equation 3.4. (b) Average firing rate of a nonleaky with mean input rate µ = 0, calculated on the basis of equation 3.4. (c) Simulated firing rate of a leaky integrate-and-fire neuron (α = 3.33) driven by at a constant mean input rate (µ = 150). Additional model parameters in (a), (b), and (c): θ = 50; V(0) = 0. Time in arbitrary units.
Noise Adaptation
1053
the change may serve an adaptive function. To the extent that the function of an individual neuron within a network is to perform a calculation based on a weighted sum of its deterministic input currents, the noise response is spurious, and noise adaptation serves the purpose of reducing the number of noise-induced spikes (i.e., reducing the number of false positives). Before we present a mathematical proof that the dynamic firing rate reduction represents an adaptation to noise, we will briefly return to the problem of investigating the average firing rate dynamics of integrate-andfire neurons. So far, we have derived an expression only for the average firing rate of an integrate-and-fire neuron that has no membrane leak. Ideally, we would like to find a general expression for the average firing rate that would apply to leaky integrate-and-fire neurons as well as nonleaky ones. Despite considerable effort, however, we have not been able to obtain such an expression. We therefore rely on computer simulation to analyze the properties of the leaky neuron in order to check the generality of our conclusions. In Figure 1c, we plot the firing rate of a simulated leaky integrate-and-fire neuron (α = 3.33) for a constant positive mean input (µ = 150) and three different levels of input noise (indicated on the plot). As in the case of the nonleaky neuron, the leaky neuron produces an initial transient response to noise, the magnitude of which is positively related to the input noise level. The transient response is followed by a decline in the firing rate while the input remains constant. Since the neural response decreases dynamically while the input remains constant, it appears that some form of neural adaptation is taking place. In fact, this state of affairs might be taken as a definition of adaptation. The renewal counting theorem, equation 3.5, can be used, together with any of several exact or approximate expressions for the mean firing time of the leaky-integrator neuron (e.g., Roy & Smith, 1969; Thomas, 1975; Ricciardi & Sacerdote, 1979; Wan & Tuckwell, 1982; Tuckwell, 1989), to derive an expression for the long-run value of the average firing rate of the neuron. Wan and Tuckwell (1982) obtained several approximate expressions for the mean firing time and noted that noise always tends to decrease the average spike interarrival time of the leaky integrator neuron (regardless of whether the mean input rate is positive or zero). From this fact and equation 3.5, it follows that the sustained firing rate of the neuron with a membrane leak increases with the level of the input noise. The simulated firing rates plotted in Figure 1c are consistent with this conclusion. Thus, we conclude that the response to noise is not completely eliminated over time in the leaky neuron. However, our firing rate simulations indicate that a greater amount of noise-response reduction (transient response minus sustained response) is observed at higher-input noise levels. The overall pattern of results indicates that the dynamic reduction in the average neural firing rate acts to reduce the level of the transient noise response in the leaky neuron, with greater compensation occurring at higher
1054
Michael E. Rudd and Lawrence G. Brown
noise levels. This pattern is consistent with the interpretation that leaky integrate-and-fire neurons adapt to reduce their response to noise. The dynamic reduction in the noise response appears to be a general feature of integrate-and-fire neurons, with or without a membrane leak, although the noise dependence of the average firing rate is asymptotically eliminated only in the case of the nonleaky neuron. 4 Generator Potential Dynamics In order to understand better the mechanism behind the dynamical changes in the neural firing rate, we now return to our analysis of the time-dependent behavior of the generator potential. Setting α = µ(t) = 0 and σ (t) = σ , a constant, in equation 3.5, taking the ensemble averages of the quantities on both sides of the equation, and time-integrating yields V(t) = V(0) − θN(t) ,
(4.1)
where V(0) is the value of the generator potential at time zero, and N(t) is the average number of resets in the interval [0, t]. Since µ(t) = 0, the neuron is driven by gaussian white noise of mean zero and neural spikes are generated solely on the basis of noise fluctuations. According to equation 4.1, the reset mechanism acts in this special case to decrease progressively the mean value of the generator potential, and since the mean input rate is zero, there is no counteracting tendency for the average potential to be increased by the input. With each additional reset, the average value of the generator potential is further decreased. We anticipate that the reduction in the mean level of the potential will be associated with a concomitant reduction in the neural firing rate. The generator potential will continue to be decreased until the cell eventually stops firing altogether. This anticipated behavior is consistent with our previous result based on the renewal counting theorem, which indicated that a nonleaky neuron driven by noise alone turns itself off in the long run. Without the insight gained from a stochastic analysis of the neuron’s behavior, an empirical observation of the neuron’s behavior might lead to the mistaken conclusion that the firing rate is subject to negative feedback. Further insight can be gained by averaging both sides of equation 2.3 (with α = µ(t) = 0 and σ (t) = σ , a constant) and dividing by dt to obtain: dN(t) dV(t) = −θ . dt dt
(4.2)
Since, as proved in the previous section, the average firing rate goes to zero as t → ∞, the time derivative of the average value of the generator potential also must go to zero in the same limit.
Noise Adaptation
1055
This approach is easily generalized to the case of an integrate-and-fire neuron with arbitrary parameters. In the most general case, dN(t) dV(t) = −αV(t) + µ(t) − θ . dt dt
(4.3)
In the last section it was shown that the firing rate tends to a constant as t → ∞. In that limit, dV(t)/dt → 0. By making this substitution in equation 4.3, we obtain the asymptotic firing rate µ − αV(∞) dN(∞) = , dt θ
(4.4)
where we have assumed µ(t) to be constant. For α = 0 (nonleaky neuron), dN(t)/dt → µ/θ as t → ∞, which is consistent with our earlier result based on the renewal counting theorem. For arbitrary α, equation 4.4 can be rewritten in the form θ V(∞) = α
Ã
! dN(∞) µ− . dt
(4.5)
The asymptotic level of the mean generator potential can be obtained from equation 4.5 by using the renewal counting theorem, equation 3.5, and any of several expressions for the mean firing time (Roy & Smith, 1969; Thomas, 1975; Ricciardi & Sacerdote, 1979; Wan & Tuckwell, 1982; Tuckwell, 1989) to compute the asymptotic average firing rate of the leaky neuron. For a leaky neuron driven by gaussian white noise of mean zero, the asymptotic value of V(t) will be negative and of larger absolute value at higher noise levels, since the asymptotic firing rate is larger for larger values of σ . Again assuming that α = 0 and µ(t) constant, we next substitute equation 3.4 into equation 4.3 to obtain ∞ X dV(t) =µ−θ fN (t) dt N=1
(4.6)
and V(t) = V(0) + µt − θ
Z tX ∞ 0 N=1
fN (s) ds.
(4.7)
We then rearrange equation 4.7 to obtain "Z V(0) − V(t) = θ
∞ tX
0 N=1
fN (s) ds −
³µ´ θ
# t .
(4.8)
1056
Michael E. Rudd and Lawrence G. Brown
The time integral on the right-hand side of equation 4.8 is the average number of spikes emitted up to time t. The quantity in parentheses is the average firing rate of an integrate-and-fire neuron with no input noise. Therefore, the quantity in brackets is the difference between the number of spikes emitted in the interval [0, t] with and without input noise. Equation 4.8 indicates that the amount by which the neural potential at time t is decreased, or hyperpolarized, relative to the resting potential is equal to the product of the threshold and the number of “extra” spikes generated when gaussian white noise is added to the deterministic input current. We thus conclude that the generator potential is hyperpolarized, on average, by an amount that depends only on the level of the noise, not on the average input level. This result supports our claim that the adaptation observed in the average spiking rate of the stochastic integrate-and-fire neuron is an adaptation to noise. Further knowledge of the dynamic behavior of V(t) can be gained by reasoning qualitatively from equation 4.6 while simultaneously recalling the dynamic behavior the infinite sum (i.e., the average firing rate) illustrated in Figure 1a. At time zero, the average firing rate of the neuron is zero, so the rate of change in V(t) must initially be positive, and V(t) itself must initially increase. In order for the time derivative of the mean generator potential to change signs, its value must pass through zero. This occurs when the average firing rate is µ/θ. The average firing rate first increases with time, then decreases asymptotically to the value µ/θ. It attains the value µ/θ at only one finite time, which occurs during the rising portion of the firing rate function. Let us call this time t0 . It follows that V(t) first increases from the value V(0) during the interval [0, t0 ), then decreases during the interval (t0 , ∞]. We verified these conclusions by plotting equation 4.7 as a function of time in Figure 2. Values of V(t) were computed numerically with V(0) = 0. Figure 2a illustrates the initial behavior of V(t) examined over a brief period. The plot indicates that V(t) first increases and thereafter monotonically decreases, as anticipated by the preceding analysis. We also verified by plotting (not shown) the average firing rate of a nonleaky neuron with the same parameters that the turnaround occurs during the initial rising phase of the neural firing rate, when the firing rate first attains the level µ/θ. Figure 2b illustrates the behavior of V(t) on a larger time scale. The initial rise in V(t) and subsequent turnaround are not obvious at this scale, but the tendency for V(t) to decrease monotonically at large adaptation times is clear. When µ = 0 (nonleaky neuron driven by gaussian white noise alone), V(t) is a monotonically decreasing function of time. In that case, the infinite sum in equation 4.6 goes to zero as t → ∞, so the time derivative of the average potential must also asymptotically go to zero. The integral in equation 4.7 diverges for large t (since it is the sum of an infinite number of density functions and the time integral of each goes to one for t large). Thus, when µ = 0, V(t) → −∞ as t → ∞. After a long period of adaptation to gaussian white noise of mean zero,
Noise Adaptation
1057
Figure 2: Time dependence of the mean generator potential of a nonleaky integrate-and-fire neuron, computed on the basis of equation 4.7. (a) The initial behavior of V(t), illustrated on a short time scale. V(t) first rises, then monotonically decreases with time. (b) When viewed over a longer adaptation time, including the brief initial period illustrated in (a), a deceleration of the final monotonic decrease in V(t) is apparent. Model parameters: θ = 1; α = 0; µ = 2 × 10−4 ; V(0) = 0. Time in arbitrary units.
1058
Michael E. Rudd and Lawrence G. Brown
Figure 3: Average level of the generator potential of a nonleaky integrate-andfire driven by a gaussian white noise with mean zero and constant variance as a function of the standard deviation of the time-integrated noise in the Brownian motion process that drives the neuron. Additional model parameters: θ = 50; α = 0; µ = 0; V(0) = 0.
the generator potential will thus, on average, be large and negative. An informal argument suggested by a reviewer allows us to use this fact to find the approximate functional dependence of the average potential on the variables σ and t. At large negative values along the potential continuum, the thresholded Brownian motion will appear identical to a regular Brownian motion with a downward-reflecting barrier imposed at V(0). The average value of such √ a “reflected Brownian motion” is proportional to the standard deviation σ t of an unrestricted Brownian motion with no absorbing or reflecting barriers. It follows that, for the integrate-and-fire √ neuron driven by be approximately proportional to σ t for sufficiently noise alone, V(t) must √ large values of σ t. We verified this informal reasoning by √ evaluating equation 4.7 numerically and plotting the function versus σ t in Figure 3. As anticipated, the average value of the generator potential is a linearly decreasing function of the time-dependent standard deviation of the original Brownian motion process for t sufficiently large. This result further supports our claim that the integrate-and-fire neuron undergoes noise adaptation and quantifies the dependence of this noise adaptation on σ and t in the particular case in which α = µ(t) = 0. The approximate equivalence between the probabilistic path structure of the generator potential of an integrate-and-fire neuron driven by noise alone and that of a reflected Brownian motion generalizes to the case of the leaky neuron, provided that the noise level is sufficiently large. In this case, the density function of the generator potential is approximately equivalent
Noise Adaptation
1059
to that of a reflected OUP at large t. In support of this claim, we note that the equivalence holds for large, negative values of V(t) and that, according to equation 4.5, (with µ = 0), large, negative values will be obtained at large adaptation times, provided that the asymptotic firing rate is sufficiently large. Recall that the asymptotic firing rate is an increasing function of the noise level; therefore, the equivalence must hold for sufficiently large values of σ . It follows that V(t) will be approximately equivalent to the average level of an OUP reflected at the origin when V(t) is large and negative. The average value of the reflected OUP is proportional to the standard deviation of the regular, unrestricted OUP (Tuckwell, 1988, 1989): s σ2 (1 − e−2αt ) . 2α We have not been able to derive an expression for the probability density function of the generator potential in the most general case in which none of the parameters of the model is fixed; however, in the appendix, we derive expressions for the density function and the first two moments of the generator potential in the special case in which α = µ(t) = 0 and σ (t) = σ , a constant. There, the first moment is obtained by a method different than the one that led to equation 4.7. 5 Summary We have investigated the effect of the reset mechanism of the stochastic integrate-and-fire neuron on the dynamics of both the average neural spiking rate and the statistics of the generator potential. The response of the neuron to a constant input with additive gaussian white noise was investigated in detail, under the assumption that the generator potential is unbounded in the negative range. By a combination of analytic and simulation methods, we showed that both leaky and nonleaky stochastic integrate-and-fire neurons produce a transient spiking response, the magnitude of which is positively related to the root-mean-square noise level of the input. In the specific case of the nonleaky neuron, it was proved that the noise response is dynamically and completely suppressed by a depression of the average level of the generator potential (i.e., hyperpolarization of the neuron) resulting from the neural reset mechanism. In a neuron with a membrane leak, the noise response is not completely suppressed by the reset mechanism, but it is nevertheless dynamically reduced by it. The reduction in the firing rate (measured by the difference between the peak transient and asymptotic spiking rates) is an increasing function of the input noise. We conclude that all integrate-and-fire neurons—leaky or not—for which the generator potential is unbounded in the negative domain adapt to the noise in their input; in the long run the adaptation is complete only in the case of the nonleaky neuron.
1060
Michael E. Rudd and Lawrence G. Brown
We derived a differential equation (see equation 4.4) that relates the rate of change in the mean level of the generator potential to the average neural firing rate. In the case of the nonleaky neuron, the average firing rate can be computed from an infinite series expression, and equation 4.4 can then be used to write an expression (see equation 4.6) for the time derivative of the mean value of the generator potential in terms of this series. We used this fact to study qualitatively the dynamic behavior of the mean generator potential. When the nonleaky neuron is driven by an input having a positive mean rate, the average potential first increases, then decreases. When the neuron is driven by gaussian white noise of mean zero, the average potential decreases over time monotonically, asymptotically tending to minus infinity in the limit of large t. For sufficiently large t, the average generator potential level in a nonleaky neuron driven by gaussian white noise of mean zero is a linearly decreasing function of the standard deviation of the time-integrated noise input. We presented an argument suggesting that this should also be the case in the model that includes a membrane leak. In the specific case of the nonleaky neuron, we showed that the average amount of decrease in the generator potential (hyperpolarization) over the interval [0, t] is directly related to the average number of extra spikes generated during that interval over and above those that would be generated by a deterministic input current of the same intensity. In other words, the generator potential is inhibited on average by an amount that depends only on the number of neural spikes that can be attributed to noise, which increases with increasing noise magnitude. It follows that the dynamic reduction in the neural firing rate similarly depends only on the magnitude of the noise fluctuations in the input. This analysis supports our claim that the dynamic change in the average neural firing rate of an integrate-and-fire neuron in response to a stochastic input with constant parameters represents noise adaptation. 6 General Conclusions The spike generation mechanism of stochastic integrate-and-fire neurons introduces an important nonlinearity into the operation of any circuit constructed of these units. In general, the consequences of this nonlinearity for the behavior of neural networks are not well understood. We have shown that the threshold nonlinearity produces a probabilistic neural adaptation that counteracts the tendency for an integrate-and-fire neuron to spike in response to noise fluctuations in its generator potential. This property of integrate-and-fire neurons might be useful in both artificial and biological neural networks, where it could function to reduce the number of falsepositive responses of individual neurons within the network. We note in particular that noise adaptation could serve to reduce, or even totally eliminate, the effect of any constant internal (thermal) noise level within a network.
Noise Adaptation
1061
We envision that noise adaptation could be used to dynamically adjust the sensitivity of individual neurons within a network as either the internal noise or the input noise level changed on a time scale that is slow compared to the time scale over which modulations in the spiking rates are interpreted as meaningful signals. By increasing the average distance between the generator potential and the neural threshold as a function of the level of voltage noise, noise adaptation would tend dynamically to control the individual gains of the network neurons to compensate for both internal and external noise (Rudd & Brown, 1994, 1995, 1996, in press). Thus, the fact that integrate-and-fire neurons undergo noise adaptation also implies that they exhibit a stochastic gain control, which in a network context could function to keep the spiking rates of individual neurons operating within an optimal range, despite gradually changing neural noise levels. If the spike generation model was made somewhat more realistic by the additional assumption of an absolute refractory period, noise adaptation would also help to protect the neurons from response saturation due to noise-induced spiking activity. We became interested in noise adaptation in integrate-and-fire neurons while carrying out simulation studies in which we examined the effect of the spike generation mechanism of retinal ganglion cells on the behavior of a retinal circuit that receives probabilistic light input (photons) (Rudd & Brown, 1993, 1994, 1995, 1996, in press). Because the number of photons falling on a small area of the retina within a small time interval obeys Poisson statistics, the mean and variance of the photon count are equal. In the context of our integrate-and-fire ganglion cell model, the noise adaptation mechanism described in this article results in a noise adaptation that depends on the level of the random fluctuations in the number of photons falling within the ganglion cell receptive field, and thus also on the mean light level. Noise adaptation in our retinal ganglion cell model results in a type of light adaptation that is inherently stochastic and is therefore distinct from the deterministic retinal light adaptation models that have been previously proposed. Our ganglion cell noise adaptation model accounts for a large body of light adaptation data, both physiological and psychophysical, that cannot be accounted for by deterministic light adaptation models. Furthermore, we have used the model to make predictions about the dependence of the apparent brightness of light flashes on the mean light level (Rudd & Brown, 1996) that have since been experimentally verified (Brown & Rudd, submitted). The ubiquity of spike generation in the central nervous system suggests that similar noise adaptation may in occur in other neural circuits. Because of the large number of neural subsystems to which our results might potentially be relevant, some discussion of the biological plausibility of the model is in order. Clearly, the intracellular potential of actual neurons cannot take on arbitrarily large negative values, as we have assumed here. However, it is probably not critical to assume that the generator potential can take on arbitrarily large, negative values in order to achieve significant amounts of
1062
Michael E. Rudd and Lawrence G. Brown
noise adaptation in integrate-and-fire neurons. The biological plausibility of the noise adaptation hypothesis appears to depend critically on whether reset-induced hyperpolarization in real neurons can build up to a sufficient degree to account for the levels of adaptation observed in vivo. In the context of our light adaptation work, we simulated the response of a neuron in which the maximum hyperpolarization state was limited by a hard saturation. In one such simulation, we imposed a lower bound of −15θ on the range of the generator potential without noticeably affecting the spiking response of the neuron when it was driven by gaussian white noise of mean zero. However, we have not experimented with variants of the hard saturation model in which the range of allowable hyperpolarization states is shallower, nor have we parametrically investigated the effects of varying the lower bound on the degree of noise adaptation. In the theoretical neurobiology literature, an alternative version of the integrate-and-fire model in which the generator potential is restricted to the range 0 ≤ V(t) ≤ θ by a reflecting barrier at zero is often alternatively assumed. This alternative boundary condition is in some sense the opposite of the assumption made in this article that there is no lower bound on the potential. Whether the lower bound on the potential is modeled as a reflecting barrier or a hard saturation, a biologically realistic integrate-and-fire model would allow for some range of negative potential states. Although restricting the negative potential states to a small range would not seem to allow much room for noise adaptation, it is difficult to estimate the amount of adaptation that should be expected from such a model without performing a complete probabilistic analysis, because the effect of hyperpolarization on the firing rate depends not just on the average potential level but also on the distribution of potential states. Perhaps even a small, negative potential range would allow for enough adaptation to account for a significant amount of the adaptation observed in actual neurons. The noise adaptation properties of our integrate-and-fire model have relevance to a large body of contemporary simulation work in which networks are constructed of stochastic integrate-and-fire neurons. In many such networks, the individual neurons within the network receive a large number of combined excitatory and inhibitory inputs. By the central limit theorem, the inputs to the spike generation mechanisms of the network neurons integrated over a sufficiently long time period will be well approximated as gaussian random variables, and the model developed in here will then apply to the individual neurons within the network. Unless lower bounds on the neural activation levels in such a network model are explicitly imposed by the modeler, the network neurons will undergo noise adaptation of the type that we have described. Caution should be exercised in attributing any observed dynamical changes in the average firing rates of individual neurons within a network solely to global network effects. At the same time, we expect the coupling of neurons within such a network to encourage the noise adaptation to spread and be modified, thus producing global noise adap-
Noise Adaptation
1063
tation effects that cannot be easily anticipated on the basis of the analysis presented here. Appendix A.1 Density Function of the Generator Potential. Our goal is to obtain the probability density function of V(t) given the boundary condition V(0). Let p(V(t)) be the density function. The integral of this density with respect to the voltage-level parameter v is the cumulative distribution function of V(t), written P{V(t) ≤ v}. We will solve for P{V(t) ≤ v}, then take the derivative with respect to v to find p(V(t)). Here we assume that α = µ(t) = 0, and σ (t) = σ , a constant. When α = 0, there is no membrane leak, and the spike generation mechanism is driven by a regular Brownian motion process. Recall that equation 2.1 describes the evolution of the generator potential between spikes. With α = 0, this equation describes the time evolution of a Brownian motion process. The probabilistic path of the generator potential thus consists of a sequence of Brownian motion paths strung together, one after another, with discontinuous breaks between the Brownian motion paths occurring at the reset times. Each time that the neuron resets, a new Brownian motion path is initiated at voltage level zero, and that particular Brownian motion path is terminated as soon as it exceeds the neural threshold. The first Brownian motion path in the sequence is initialized at the level V(0), which, in general, may be different from zero. Although every generator potential path consists of a sequence of Brownian motion paths, the overall generator potential path will not be Brownian motion path unless no resets of the potential occur along the path. In general, the path structure of the generator potential includes the effects of resets, whereas that of Brownian motion does not. This difference is reflected in the two different stochastic differential equations that describe the time evolution of Brownian motion (see equation 2.1) and that of the generator potential (see equation 2.3). Because the two processes are subject to different probability laws, we need to distinguish them clearly in what follows. We will use B(t) to denote a regular Brownian motion path (without resets and terminated on reaching threshold) and reserve V(t) to signify the generator potential path, which is reset to zero whenever it exceeds θ . Let B(t) be a Brownian motion path that is initialized (by either a reset or the initial boundary conditions) at an arbitrary level v0 at any time s within the interval [0, t]. The probability that B(t) is below level v at time t is then given by the normal integral φ(v − v0 , t − s) ≡ P{B(t) ≤ v | B(s) = v0 } Z v 1 2 2 = √ e−(x−v0 ) /(2σ (t−s)) dx σ 2π(t − s) −∞
(A.1)
1064
Michael E. Rudd and Lawrence G. Brown
(Karlin & Taylor, 1975). We will use this result to compute the distribution function P{V(t) ≤ v} by noting that, at any arbitrary time t, V(t) will be a point on a Brownian motion path that was initialized at the time of the last reset (or at time zero if no reset has occurred). Therefore, P{V(t) ≤ v} will be a sum of probabilities of the form in equation A.1, each weighted by the conditional probability that V(t) is on a path initialized at (v0 , s). The generator potential paths for which V(t) ≤ v can be partitioned into two subgroups. The first group consists of those paths that exceed the neural threshold θ at least once in the interval [0, t]; the second group consists of those that do not. We may therefore write P{V(t) ≤ v} as the sum of two joint probabilities: P{V(t) ≤ v} = P{V(t) ≤ v, no resets} + P{V(t) ≤ v, at least one reset}.
(A.2)
Define the maximum height during the interval [s, t] of a Brownian motion path initialized at (v0 , s) as M(s, t) ≡ max {B(t) | B(s) = v0 }. s≤u≤t
The first term on the right side of equation A.2 can then be rewritten in terms of the Brownian motion process according to the law of total probability: P{V(t) ≤ v, no resets} = P{M(0, t) < θ, B(t) ≤ v | B(0) = V(0)} = P{B(t) ≤ v | B(0) = V(0)} − P{M(0, t) ≥ θ, B(t) ≤ v | B(0) = V(0)}. (A.3) By applying the well-known reflection principle (Karlin & Taylor, 1975, p. 345), it can be shown that P{M(0, t) ≥ θ, B(t) ≤ v | B(0) = V(0)} = P{B(t) ≥ 2θ −v | B(0) = V(0)} = 1 − φ(2θ − v − V(0), t),
(A.4)
the last step in equation A.4 being an application of equation A.1. Substituting equations A.1 with v0 = V(0) and s = 0) and A.4 directly into equation A.3 then yields P{V(t) ≤ v, no resets} = φ(v − V(0), t) + φ(2θ − v − V(0), t) − 1. (A.5) We next derive an expression for the second probability on the righthand side of equation A.2. Note that for every generator potential path that contributes to this probability, there must be one and only one time in the interval [0, t] at which the potential was last reset. We can therefore assign a probability density to each possible last reset time. Let s ∈ [0, t] be the
Noise Adaptation
1065
time of the last reset. This is equivalent to saying that a reset occurs at s, and no further resets occur in the interval [s, t]. We can therefore write the joint probability for the event {last reset at s, V(t) ≤ v} in the form P{V(t) ≤ v, no resets ∈ [s, t] | reset at s}. The probability that we seek will then be an integral over s of terms of this form, each weighted by the average number of resets dN(s) that occur within a small neighborhood of s. But dN(s) =
dN(s) ds, ds
and, for the particular case of a nonleaky integrate-and-fire neuron considered here, dN(s)/ds can be written in terms of the infinite series (3.4). We can therefore write the second probability on the right-hand side of equation A.2 in the form P{V(t) ≤ v, at least one reset} Z tX ∞ fN (s)P{V(t) ≤ v, no resets ∈ [s, t] | reset at s} ds = 0 N=1
=
Z tX ∞
0 N=1
fN (s)P{V(t) ≤ v, no resets ∈ [s, t] | V(s) = 0} ds. (A.6)
The last step in equation A.6 follows from the Markov property of Brownian motion: the probability of any given path over the interval [s, t] depends only on the initial boundary condition V(s) = 0 and not on how the generator potential got there. Note that paths that “accidentally” satisfy the condition of the final conditional probability (i.e., when no reset actually occurs at s) will not be given any weight by the infinite series weighting term, because all of the weight corresponds to paths that reset at s; so paths that accidentally cross the zero line at s will effectively be given a zero weighting. Furthermore, although there is no limit to the number of resets that can occur in the interval [0, t], all resets other than last resets are automatically discounted by the nature of the joint probability within the integrand of equation A.6. Since every path that leads to V(t) will have one and only one last reset time, each path will be counted once and only once, ensuring that the probability in equation A.6 will be properly normalized. Continuing with our derivation of the generator potential density function, we next note that the following two expressions are equivalent: P{V(t) ≤ v, no resets ∈ [s, t] | V(s) = 0} = P{V(t − s) ≤ v, no resets ∈ [0, t − s] | V(0) = 0},
(A.7)
and that the probability on the right-hand side of equation A.7 is trivially obtained by substituting 0 for V(0) and t − s for s in equation A.5 to get
1066
Michael E. Rudd and Lawrence G. Brown
P{V(t) ≤ v, no resets ∈ [0, t − s] | V(0) = 0} = φ(v, t − s) + φ(2θ − v, t − s) − 1.
(A.8)
Combining equations A.6 through A.8 then yields P{V(t) ≤ v, at least one reset} Z tX ∞ = fN (s)(φ(v, t − s) + φ(2θ − v, t − s) − 1) ds.
(A.9)
0 N=1
Finally, we substitute equations A.5 and A.9 into equation A.2 to obtain P{V(t) ≤ v} = φ(v − V(0), t) + φ(2θ − v − V(0), t) − 1 Z tX ∞ fN (s)(φ(v, t − s) + 0 N=1
+φ(2θ − v, t − s) − 1) ds.
(A.10)
The density function of the generator potential is the derivative with respect to v of equation A.10: ³ ´ 1 2 2 2 2 e−(v−V(0)) /(2σ t) − e−(2θ −v−V(0)) /(2σ t) p[V(t)] = √ σ 2π t Z tX ∞ 1 + fN (s) √ σ 2π(t − s) 0 N=1 ³ 2 ´ 2 2 2 × e−v /(2σ (t−s)) − e−(2θ −v) /(2σ (t−s)) ds. (A.11) Because the density function does not have a simple analytic form, its dependence on the parameters of the model can be understood most easily by numerically evaluating the function and plotting it for different parameter values. Since our overarching goal is to understand the mechanism of noise adaptation in the integrate-and-fire neuron, we will restrict our further analysis of equation A.11 to the task of finding its dependence on the root-mean-square noise magnitude σ and adaptation time t. Note that wherever these two variables appear within the expression, they always appear together as part of the unit σ 2 t, which is the time-integrated variance of the Brownian motion process that drives the neuron. It is therefore unnecessary to examine the dependence of the function on the two variables separately. In Figure 4, we plot√equation A.11 for three different levels of the noise standard deviation σ t (with V(0) = 0 and θ = 50). The plots indicate that the spread in the generator potential distribution increases with the standard deviation of the Brownian motion process that drives the neuron. Because the range of V(t) is bounded on the upper end by θ, large variability in V(t) is associated with a high probability of obtaining large, negative voltage values.
Noise Adaptation
1067
Figure 4: Examples of the probability density function (pdf) of the generator potential of a nonleaky integrate-and-fire neuron driven by gaussian white noise with mean zero and constant variance. The three pdfs correspond to three different standard deviations of the time-integrated noise in the Brownian motion process that drives the neuron (indicated on plot). Additional model parameters: θ = 50; α = 0; µ = 0; V(0) = 0.
A.2 The First and Second Moments of the Generator Potential Distribution. The first and second moments of the generator potential density (see equation A.11) can be calculated in a straightforward manner using integral calculus. The first moment is · µ ¶ θ − V(0) V(t)µ=0 = V(0) − θ Erf c √ σ 2t ¶ # µ Z tX ∞ θ ds , (A.12) fN (s)Erf c + √ σ 2(t − s) 0 N=1 where Erf c denotes the complementary error function: 1 − Erf and the error function Erf is defined in the usual way: Z x 2 2 e−z dz Erf (x) = √ π 0 (Gradshteyn & Ryzhik, 1980, p. 930). Recall that we already have an expression for V(t) (see equation 4.7), which was derived on the basis of the stochastic differential equation approach. Setting µ = 0 in that expression should produce an expression that is equivalent to equation A.12. The two expressions are equal if the following general relation holds: ¶ µ ¶ µ Z t k k (A.13) ds = Erf c √ , R(s)Erf √ t−s t 0
1068
Michael E. Rudd and Lawrence G. Brown
where we have defined R(s) =
∞ X nk − N2 k2 e s . √ πs3 n=1
We have checked equation A.13 numerically using Gauss-Legendre 100point quadrature over several orders of magnitude of t and k and have found excellent agreement; however, we have not been able to prove the equality. To write the second moment, we first define r
2t − (θ −v20 )2 e 2σ t (θ + v0 ) ψ(t, v0 ) ≡ 2θv0 − 2θ + σ π µ ¶ θ − v0 + Erf (v20 + σ 2 t + 2θ 2 − 2θ v0 ), √ σ 2t 2
(A.14)
then write V 2 (t)µ=0 = ψ(t, V(0)) +
Z tX ∞ 0 N=1
fN (s)ψ(t − s, 0) ds.
(A.15)
References Brown, L. G., & Rudd, M. E. Evidence for a noise-gain control mechanism in human vision. Manuscript submitted for publication. Calvin, W. H., & Stevens, C. F. (1965). A Markov process for neuron behavior in the interspike neuron. In Proceedings of the 18th Annual Conference on Engineering in Medicine and Biology (vol. 7, p. 118). Capocelli, R. M., & Ricciardi, L. M. (1971). Diffusion approximation and first passage time problem for a model neuron. Kybernetik 8:214–233. Chapeau-Blondeau, F., Godiver, X., & Chambet, N. (1996). Stochastic resonance in a neuron model that transmits spikes. Physical Review E, 53:1273–1275. Geisler, C. D., & Goldberg, J. M. (1966). A stochastic model of the repetitive activity of neurons. Biophysical Journal 6:53–69. Gerstein, G. L. (1962). Mathematical models for the all-or-none activity of some neurons. Institute of Radio Engineers Transactions on Information Theory 8:137– 143. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophysical Journal 4:41–68. Gluss, B. (1967). A model for neuron firing with exponential decay of potential resulting in diffusion equations for probability density. Bulletin of Mathematical Biophysics 29:233–243. Gradshteyn, I. S., & Ryzhik, I. M. (1980). Table of integrals, series, and products. New York: Academic Press. Holden, A. V. (1976). Models of the stochastic activity of neurons. Berlin: SpringerVerlag.
Noise Adaptation
1069
Johannesma, P. I. M. (1968). Diffusion models for the stochastic activity of neurons. In E. R. Caianiello (Ed.), Neural Networks. Berlin: Springer-Verlag. Karlin, S. M., & Taylor, H. M. (1975). A first course in stochastic processes. New York: Academic Press. Parzen, E. (1962). Stochastic processes. San Francisco: Holden-Day. Ricciardi, L. M. (1977). Diffusion processes and related topics in biology. Berlin: Springer-Verlag. Ricciardi, L. M., & Sacerdote, L. (1979). The Ornstein-Uhlenbeck process: A model for neuronal activity. Biological Cybernetics 35:1–9. Roy, B. K., & Smith, D. R. (1969). Analyis of the exponential decay model of the neurons showing frequency threshold effects. Bulletin of Mathematical Biophysics 31:341–357. Rudd, M. E. (1996). A neural timing model of visual threshold. Journal of Mathematical Psychology 40:1–29. Rudd, M. E., & Brown, L. G. (1993). A stochastic neural model of light adaptation. Investigative Ophthalmology and Visual Science 34:1036. Rudd, M. E., & Brown, L. G. (1994). Light adaptation without gain control: A stochastic neural model. Investigative Ophthalmology and Visual Science 35:1837. Rudd, M. E., & Brown, L. G. (1995). Temporal sensitivity and threshold vs intensity behavior or a photon noise-modulated gain control model. Investigative Ophthalmology and Visual Science 35:S16. Rudd, M. E., & Brown, L. G. (1996). Stochastic retinal mechanisms of light adaptation and gain control. Spatial Vision 10:125–148. Rudd, M. E., & Brown, L. G. (In press). A model of Weber and noise gain control in the retina of the toad Bufo marinus. Vision Research. Stein, R. B. (1965). A theoretical analysis of neuronal variability. Biophysical Journal 5:173–194. Stein, R. B. (1967). Some models of neuronal variability. Biophysical Journal 7:37– 68. Thomas, M. U. (1975). Some mean first passage time approximations for the Ornstein-Uhlenbeck process. Journal of Applied Probability 12:600–604. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology: Vol. 2. Nonlinear and stochastic theories. Cambridge: Cambridge University Press. Tuckwell, H. C. (1989). Stochastic processes in the neurosciences. Philadelphia: SIAM. Uhlenbeck, G. E., & Ornstein, L. S. (1930). On the theory of Brownian motion. Physical Review 36:823–841. Wan, F. Y. M., & Tuckwell, H. C. (1982). Neuronal firing and input variability. Journal of Theoretical Neurobiology 1:197–218.
Received January 16, 1996; accepted October 21, 1996.
Communicated by Mikhail Tsodyks
Paradigmatic Working Memory (Attractor) Cell in IT Cortex Daniel J. Amit Stefano Fusi Racah Institute of Physics, Hebrew University, Jerusalem, and INFN, Istituto di Fisica, Universit`a di Roma, Italy
Volodya Yakovlev Department of Neurobiology, Hebrew University, Jerusalem, Israel
We discuss paradigmatic properties of the activity of single cells comprising an attractor—a developed stable delay activity distribution. To demonstrate these properties and a methodology for measuring their values, we present a detailed account of the spike activity recorded from a single cell in the inferotemporal cortex of a monkey performing a delayed match-to-sample (DMS) task of visual images. In particular, we discuss and exemplify (1) the relation between spontaneous activity and activity immediately preceding the first stimulus in each trial during a series of DMS trials, (2) the effect on the visual response (i.e., activity during stimulation) of stimulus degradation (moving in the space of IT afferents), (3) the behavior of the delay activity (i.e., activity following visual stimulation) under stimulus degradation (attractor dynamics and the basin of attraction), and (4) the propagation of information between trials—the vehicle for the formation of (contextual) correlations by learning a fixed stimulus sequence (Miyashita, 1988). In the process of the discussion and demonstration, we expose effective tools for the identification and characterization of attractor dynamics.1 1 Introduction 1.1 Delay Activity in Experiment. A number of cortical areas, such as the inferotemporal cortex (IT) and prefrontal cortex, have been suggested as part of the working memory system (Fuster, 1995; Miyashita & Chang, 1988; Nakamura & Kubota, 1995; Wilson, Scalaidhe, & Goldman-Rakic, 1993). The phenomenon is detected in primates trained to perform a delay match-tosample (DMS) task, or a delay eye-movement (DEM) task, using, in each case, a relatively large set of stimuli. It is observed in single-unit extracellular recordings of spikes following, rather than during, the presentation of a 1 A color version of this article is found on the Web at: http://www.fiz.huji.ac.il/staff/acc/faculty/damita
Neural Computation 9, 1071–1092 (1997)
c 1997 Massachusetts Institute of Technology °
1072
Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev
sensory stimulus. In these tasks, the behaving monkey must remember the identity or location of an initial eliciting stimulus in order to decide on its behavioral response following a second stimulus. This latter test stimulus is, with equal likelihood, identical to or different from the first stimulus in the DMS task and is simply a “go” signal in the DEM task. In these areas of the cortex, neurons are observed, in rather compact regions of the same area (called modules or columns), to have reproducible elevated spike rates during the delay period, after the first, eliciting stimulus has been removed.2 Such elevated rate distributions have been observed to persist for long times (compared to neural time constants)—as long as 30 seconds. The rates observed are in the range of about 10–20 spikes per second (s−1 ), against a background of spontaneous activity of a few s−1 . The subset of neurons that sustain elevated rates, in the absence of any stimulus, is selective of the preceding, first, or sample stimulus. Thus, the distribution of delay activity could act as a neural representation or memory of the identity of the eliciting stimulus, transmitted for later processing in the absence of the stimulus. Moreover, in one experiment in which training was carried out with eliciting stimuli presented in a fixed order, it was observed that though the stimuli were originally generated to be uncorrelated, the resulting learned delay activity distributions (DADs), corresponding to the uncorrelated stimuli, were themselves correlated (Miyashita & Chang, 1988). That is, there was an elevated probability that the same neuron would respond with an elevated delay activity for two stimuli that were close in the sequence. In fact, the magnitude of the correlations expressed the temporal separation of the stimuli in the training sequence: the closer two stimuli were in the training sequence, the higher the correlation of the corresponding two DADs. (For a detailed discussion, see Amit, 1994, 1995; and Amit, Brunel, Tsodyks, 1994.) 1.2 The Attractor Picture. These experimental findings have been interpreted as an expression of local attractor dynamics in the cortical module: A comprehensive picture has been suggested (Amit 1994, 1995; Amit et al., 1994) that connects the DAD to the recall of a memory into a working status and views the DAD as a collective phenomenon sustained by a synaptic matrix formed in the process of learning. The collective aspect is expressed in the mechanism by which a stimulus that had been learned previously has left a synaptic engram of potentiated excitatory synapses connecting the cells driven by the stimulus. When this subset of cells is reactivated, they cooperate to maintain elevated firing rates in the selective set of neurons, via the same set of potentiated synapses, after the stimulus is removed. In this way, these cells can provide each other, that is, each one of the group, with an afferent signal that clearly differentiates the members of the group from
2 A similar phenomenon has been observed in the motor cortex by Georgopoulos (private communication).
Paradigmatic Working Memory Cell
1073
other cells in the same cortical module. Theoretical and simulation studies have demonstrated that this signal may be clear enough to overcome the noise generated by the spontaneous activity of all other cells, provided that not too many stimuli have been encoded into the synaptic system of the module. A large number of “correct” neurons must collaborate to sustain the pattern. A rough estimate is that about 1000–2000 cells would collaborate in a given DAD (Brunel 1994), out of some 100,000 cells in a column of 1 mm2 . There would be considerable overlap between DADs. The collective nature of the DAD is related to its attractor property. Since so many cells collaborate, the DAD is robust to stimulus “errors.” That is, even if a few of the cells belonging to the self-maintaining group are absent from the group initially driven by a particular test stimulus, or if some of them are driven at the “wrong” firing rates, once the stimulus is removed, the synaptic structure will reconstruct the “nearest” distribution of elevated activities of those it had learned to sustain. The reconstruction (the attraction) will succeed provided the deviation from the learned template is not too large. All stimuli leading to the same DAD drive activities in the basin of attraction of that attractor. Each of the stimuli presented repeatedly during training creates an attractor with its own basin of attraction (see Amit & Brunel, 1995). In addition, the same module can have all neurons at spontaneous activity levels, if the stimulus has driven too few of the neurons belonging to any of the stimuli previously learned. 1.3 Attractors and Learning Correlations. According to the attractor picture, attractors are established through the learning process. In experiments in which monkeys performed a DMS task using as many as 100 stochastically generated visual stimuli, selective delay activities were formed for all images repeatedly used in the training phase. This finding supports a dynamic interpretation for the delay activity and confirms the presence of a learning process, since it is rather unlikely that synaptic structures related to these stimuli had been generated prior to training. In fact, no associated delay activities were found for new images—not used during training—but introduced subsequently in the testing stage (Miyashita & Chang, 1988). Moreover, the correlations between DADs for sequentially appearing stimuli in the fixed training order, corresponding to uncorrelated images, are even more directly related to learning, inasmuch as these correlations are dependent on the particular fixed order in which the stimuli appeared in the particular training protocol. Learning these correlations presents a puzzle that is naturally resolved in the attractor picture: How does the information contained in one image (used as the first, eliciting, stimulus in one trial) propagate, in the absence of the image, until the next image in the sequence is presented, several seconds later, as the first stimulus in the next trial? Attractors provide a natural vehicle for the propagation of this information (Griniasty, Tsodyks, & Amit, 1993; Amit et al., 1994; Amit, 1995), as follows: During training, there are initially no delay activity attractors. If
1074
Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev
the stimuli in the training set are uncorrelated, as DADs are first formed they are necessarily uncorrelated because there is no way to communicate information between successive stimuli. Once the uncorrelated DADs have been formed, however, these DADs themselves may be used to propagate information between trials. Recall that in half of the trials, the second stimulus is identical to the first (so as to maintain an unbiased probability of same and different trials). This second stimulus may also leave behind a delay activity distribution, during the intertrial interval, identical to the DAD excited by the first stimulus, during the interstimulus interval of the trial. If this delay activity is not disturbed by the motor response of the animal (and other unmonitored visual or nonvisual events in the intertrial interval), then it may persist through the intertrial interval and will be present in the network when the next image in the sequence is presented as the first stimulus in the successive trial. This delay activity will be present in the sense that there will be a time window in which the cells of the delay activity are still at elevated rates, while the cells corresponding to the new stimulus begin to be driven. Information is then available for learning the correlation of the two (Brunel, 1996).
1.4 Experimental Questions. In this study, we outline an experimental program. We show recordings, and results of their analysis, from a paradigmatic cell—an IT neuron with a clear delay activity, which we were successful in holding for a long, stable recording session, sufficient for a detailed series of tests. Given the very detailed recordings of selective delay activities reported by Miyashita and Chang (1988); Miyashita (1988); Sakai and Miyashita (1991); Wilson et al. (1993); and Nakamura and Kubota (1995), on the one hand, and the detailed picture of attractor networks (Amit, 1994, 1995; Amit et al., 1994) on the other, one naturally asks what would be expected of a cell that participates in an attractor net. We ask the question in this article, and describe in detail the expected answer by demonstrating the results from this particularly rich paradigmatic neuron. Our goal is not to present data that give a definitive answer to this question of whether IT delay activity reflects an attractor network serving stimulus image memory. For that we would have to record and analyze many cells from more than one monkey. We wish instead to raise the issues quantitatively and point out, by considering in detail data from a single example, what experiments need to be carried out and what behavior one must expect from neurons participating in an attractor neural net. The central issue with which we deal is neuronal behavior following presentation of a degraded stimulus. This is because the attractor picture naturally implies basins of attraction for every attractor. That is, it predicts that although the input to IT will be different for stimuli of different degrees of degradation, the attractor, that is, the delay activity distribution within IT, will always be the same, as long as the degraded stimulus is not too far from
Paradigmatic Working Memory Cell
1075
the original. Thus, the attractor theory predicts very different characteristics for the response of IT neurons during stimulation (reflecting the response driven by the input to IT) and the response following stimulation (reflecting the attractor state). Testing empirically the attractor picture raises a set of issues: 1. IT observability. Are the stimuli used in these experiments identifiable at the level of the relevant cortical region, namely, IT? Do these stimuli induce neuronal responses in the recorded area that are clearly higher than spontaneous, as well higher than the delay activities? Are these responses different for the different stimuli? 2. Significance of prestimulus activity. Interstimulus interval (ISI) activity is identified in relation to activity immediately preceding the first stimulus of a trial. Is the latter merely spontaneous activity, or could it be delay activity following the second stimulus in the preceding trial? The question is particularly pertinent when trials are organized in series. 3. Moving in the space of IT-observable stimuli. What is the effect of stimulus degradation, for example, using degraded or noisy stimuli? What is the effect on visual responses (during the stimulus) and on attractor dynamics, as expressed in a given cell? 4. Basin of attraction. Does motion in IT-observable space bring about an abrupt change in delay activity? 5. Information transport between consecutive trials. Can delay activity act as the agent for the generation of correlations between sequentially appearing stimuli, as in Griniasty et al. (1993) and Amit et al. (1994)? 2 Methods 2.1 Preparation, Stimuli, and Task. A rhesus monkey (Macaca mulatta) was trained to perform a DMS task on 30 stochastically generated images, 15 fractals, and 15 Fourier descriptors; much as in Miyashita (1988) and Miyashita, Higuchi, Sakai, and Masui (1991), the two types were randomly intermixed. Three examples of the stimuli are shown in Figure 1: from left to right, two fractals and a Fourier descriptor. The left two images, the fractals, are the ones we used in recording most of the data reported here. The behavioral paradigm was as follows. Following the appearance of a flickering fixation point on the monitor screen in front of the monkey, the monkey was to lower a lever and keep it in the down position for the duration of the trial (see Figure 2). After the lever was lowered and following
1076
Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev
Figure 1: Three of the 30 color images used in the DMS experiments. (Left) Fractal stimulus, which induced the clearest delay activity for the neuron of Figure 3 (number 5 in the figure) and therefore called the best stimulus for this neuron. (Center) Fractal stimulus that induced delay activity indistinguishable from the average prestimulus activity, as demonstrated in Figure 3 (number 17 in the figure), and is therefore called the worst stimulus for this neuron. (Right) A Fourier descriptor type stimulus, which induced response 14 in Figure 3.
a 1000 ms delay, the first (eliciting, or sample) stimulus was presented on the screen for 500 ms. The ISI was 5 seconds. Following this delay, the second (test) stimulus was shown, also for 500 ms. This second test stimulus was the same as the first sample stimulus in half of the trials and different in the other half (chosen randomly from the other 14 stimuli of its type, fractal or Fourier descriptors). After the second stimulus was turned off, the monkey had to shift the bar left or right depending on whether the second stimulus was the same as the first or different, respectively, and then, when the fixation point stopped flickering, to release the bar. Correct responses were rewarded by a squirt of fruit juice (see Figure 2). The monkey had been trained to a performance level not less than 80 percent correct responses when the reported experiment took place. 2.2 Recording. We recorded extracellularly spike activity from single neurons of the inferotemporal cortex. The experiment is still in progress so we cannot give the histological verification of the recorded cell. However, the coordinates of the position of the guide tube were A14 L15 during vertical penetration into the ventral cortex. The experimental procedures and care of the monkey conformed to guidelines established by the National Institutes of Health for the care and use of laboratory animals. After a cell was isolated by a six-channel spike sorter (Multi-Spike Detector; ALPHA OMEGA Engineering, Nazareth, Israel), the experiment proceeded in two stages. During the first stage, all 30 stimuli were presented one to three times. Average PST (peri-stimulus) response histograms were produced online, as demonstrated in Figure 3. From these, the best and worst stimuli were determined by eye. The best stimulus was that which evoked the clearest excess of the average rate in the ISI over the average rate in the period immediately preceding the first stimulus (the prestimu-
Paradigmatic Working Memory Cell
1077
Figure 2: Schematic sequence of events in each phase of DMS task.
lus rate: stimulus 5 in Figure 3, corresponding to Figure 1, left). The worst stimulus was selected as one that produced an overall weak response and an indistinct difference between prestimulus and delay period firing rate (stimulus 17 in Figure 3, corresponding to Figure 1, center). At this stage, the prestimulus firing rate was interpreted naively as the average activity in a 1.0 second interval prior to the presentation of the trial’s first stimulus. For the second stage of the experiment, a new image set was created from the chosen best and worst stimuli. The red-green-blue (RGB) pixel map of each image was degraded by superimposing uniform noise at one of four levels. That is, we generated from each of these two images a set of five images: the pure, original one (referred to as degradation 0) and four degraded images, each at an increasing level of degradation (levels 1–4). The DMS task was then resumed, using as the first (sample) stimulus in the DMS task a randomly selected image from the new set of 10 (original and degraded) images. The second stimulus, the test image, was randomly selected as an undegraded version of either the best or the worst stimulus. In Figure 4 we reproduce the best stimulus and degraded versions of this image at each of the four degradation levels used in this second stage of the experiment. 2.3 Mean Rate Correlation Coefficient. In order to test various hypotheses concerning the structure and origin of persistent activities, we introduce a coefficient measuring the correlation of the trial-by-trial fluctuations of the mean firing rates—or spike counts—in two different intervals. The mean rate correlation (MRC) is defined as follows: xn and yn (n = 1, . . . , Nst —the total number of selected trials) are, respectively, the spike counts in two
1078
Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev
Figure 3: Spike rate histograms (full scale = 50 s−1 ) for a small number of trials of the entire set of 30 color stimuli, used online for the selection of best and worst stimuli. Each window is divided into six temporal intervals; the neuron’s activity is binned and averaged for the different trials with the same stimulus. From left: Prestimulus interval (66 ms per bin for the 1000 ms just prior to the first stimulus of the trial); first, or sample, stimulus period (33 ms per bin for 500 ms); delay (ISI) interval (125 ms per bin for 5 sec); second, or test, stimulus period (showing the response for the one-half of the trials when the stimulus was the same as the first; 33 ms per bin for 500 ms); post–second stimulus interval (showing the activity when the second stimulus was the same as the first; 66 ms per bin for 1000 ms). Finally we show the response to the second stimulus of other trials (when the first stimulus was not that of this window, but the second stimulus was that of this window (so that the trial was a nonmatch trial; 33 ms per bin for 500 ms). The horizontal line over each interval is the average rate in the interval. On the right of every window are stimulus number, number of trials with this image as first stimulus ( f ); number of trials with the same stimulus as second stimulus in the cases of same (s) and different (d) trials. Best is stimulus 5, the one evoking the highest excess of the average rate in the ISI over the average prestimulus rate; worst is stimulus 17, the one having an overall weak response. Vertical ticks are 10 spikes/sec.
Paradigmatic Working Memory Cell
1079
Figure 4: Image degradation (in color) as gradual motion in stimulus space for testing the attractor dynamics of delay activity. From left to right: pure best image (0 noise), then four levels of noise on RGB map.
intervals in the nth trial. Then: h(x − hxi)(y − hyi)i MRC = p (hx2 i − hxi2 )(hy2 i − hyi2 )
(2.1)
where h· · ·i denotes the estimate of the expectation of a random variable performed over all the selected trials, that is, hxi =
Nst 1 X xi . Nst i=1
This coefficient measures the extent of the correlation between the deviations from the mean over the selected set of trials, of the spike counts in each of the two intervals. With this tool, we monitor whether high rates during visual stimulus presentation imply high rates in the delay period and low imply low, whether high (low) counts at the beginning of a delay period imply high (low) counts in later intervals in the delay, and whether high (low) counts following a given second stimulus are related to high (low) counts in the period immediately preceding the first stimulus of the succeeding trial, its prestimulus interval. The value of the MRC is in the interval [−1, 1]. The standard of comparison for the magnitude of this coefficient will be established by considering subclasses of trials in which the average rates in both intervals are both high or both low. For instance, the MRC of the subset of responses to the worst stimulus is calculated by performing the expectation of equation 2.1 over the trials in which the worst stimulus was presented as a sample. In this case, if the fluctuations about the mean spike count are random and hence uncorrelated, the MRC coefficients will set the scale for lack of correlation. As we shall see, large values of MRC can capture two different phenomena: (1) trial-by-trial correlation of the fluctuation of the individual counts about their mean and (2) the existence of subsets of trials with very different mean counts in each, while within each subset, fluctuations about the mean may even be uncorrelated.
1080
Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev
3 Experimental Testing and the Special Cell The issues raised in section 1.4 will now be discussed in the same order in terms of the responses of a single IT cell. The intention is not to claim that the data presented here give a definite answer to these questions. That would require the recording and analysis of many cells and, traditionally, in more than one monkey. The objective, rather, is to raise the questions, the problems, and the potential answers by considering in detail the record of a particularly rich cell that exhibits behavior in relation to which we may pose most of the problems and demonstrate the form answers may take. Following continued recording from numerous neurons in this and another monkey, we will be able to report whether this area of IT may in fact serve this memory function for DMS tasks. Alternatively, we may report that too few IT neurons have the expected significant DAD activity to be able to serve this function under the conditions used in our experiment, and it would be best to look for such DADs elsewhere. Finally, we may find that delay activity is present in IT but does not generally have the properties we expect (and which are reported here for this single cell), so that our understanding of cortical dynamics should best be revised. 3.1 Identification of the Stimulus. Delay activity distributions are likely to be sustained by the synaptic structure, learned or preexisting. The fact that the delay activities are selective to the preceding stimulus implies that the synaptic structure can sustain a variety of distributions in the absence of a stimulus. Which of the stable delay activities is actually propagating depends on which was the last effective stimulus. But different stimuli presented on the monitor screen may not actually lead to different neuronal responses in deep structures (higher cortical areas) such as the IT cortex, since the differences may well be filtered out at afferent levels. It is the representation of the stimulus as it affects the cells of IT that is relevant for which of the DADs is excited or learned. Thus, for example, it may be that differences of scale or color in the external image may result in very similar distributions of afferents when arriving at the IT cortex. In that case, they cannot elicit different DADs. In other words, stimuli and stimulus differences must be IT observable. (For some recent work on observability of stimulus variation in IT, see Kovacs, Vogels, & Orban, 1995.) What can give rise to different DADs are significantly different distributions of synaptic inputs that arrive from previous cortical areas to IT. These distributions, in turn, can be observed by recording the neuronal activity in the attractor network during presentation of the stimulus. In fact, the IT cortex is particularly propitious in this respect because rates observed during the presence of the stimulus in visually responsive cells that participate in a DAD are much higher than those observed for the same cells during interstimulus intervals of the trial. (See, e.g., Miyashita & Chang, (1988), and responses in the underlined intervals in the histograms for best stimu-
Paradigmatic Working Memory Cell
1081
lus in Figure 5.) This allows a very clear identification of the stimulus, for comparison with the attractor. In other cortical areas, such as the prefrontal cortex, this may not be as clear (see, e.g., Wilson et al., 1993; and Williams & Goldman-Rakic, 1995). The range of responses during the stimulus and during the ISI, as in Figures 3 and 5, exhibits IT observability of the stimuli as well as of their differences. Thus, the stimuli used for this experiment, the 30 fractals and Fourier descriptors, are IT observable in that the responses to some of them are much higher than the background spontaneous rates, as well the rates during the delay period ISI. This difference from background will be measured quantitatively below. In addition, it is clear that our neuron also differentiates between stimuli (e.g., between the best and the worst stimuli) during and following the stimulation.
3.2 Delay Activity and Spontaneous Activity. It is quite common to consider the ongoing activity before the first stimulus (the prestimulus activity) in a given cell as spontaneous activity. Delay activity is considered significant if its rate is significantly higher than the prestimulus rate. In this way, one observes in Figure 5, in the first three windows on the left of the best stimulus row, that the average rate (the horizontal line) during the delay (the last, wide interval) is higher than in the prestimulus interval of the current trial (see the numbers in the table below the histograms). They become equal in the two windows on the right (see section 4). In the leftmost window of the best row of histograms, corresponding to the undegraded best stimulus, the ratio ISI/pre-stim rates is 13/8, which may not be considered very convincing as a signal-to-noise ratio. But in a context in which delay activities exist, prestimulus rates are not necessarily spontaneous. For example, if the second stimulus in a trial evokes a DAD when presented as a first stimulus and if the neuron being observed has an elevated rate in this DAD, then one would expect that following the second stimulus, this neuron would have an elevated rate. In fact, this is observed in Figure 6, where we plot histograms of spike rates that follow presentation and removal of the best (top) and the worst (bottom) stimuli as second stimulus. These histograms are averages over all trials at all levels of the first stimulus degradation, since the second stimulus is always undegraded. What one observes is that following the best second stimulus, the level of post–second-stimulus persistent rate is as high as the delay activity for the first three levels (0–2) of degradation in Figure 5 (no degradation and the next two levels of degradation). This is true even when the first stimulus, due to its degradation, does not provoke significant delay activity. It is consistent with the fact that this persistent activity is provoked by the second stimulus that is never degraded. Another expression of the same fact is presented in Figure 7. These are the trial-by-trial distributions of the post–second stimulus average rate for best
1082
Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev
Figure 5: Testing attractor dynamics: spike rate histograms (rate scale 10 s−1 , bins 50 ms) for best (top row) and worst (second row) first stimulus, each original, and four levels of degradation. Each histogram is divided into three intervals. From left: prestimulus of the current trial, first stimulus, delay interval. The horizontal line over each interval is average rate in the interval. The value of the average rates for each interval is reported in the table below the histograms with the mean rates and the standard errors (in s−1 ). The delay interval (ISI) has been divided in five parts of 1 s each. The mean rate of the activity during the entire ISI is in the last column on the right. For both best and worst first stimulus, one observes a variation of the rate during stimulation (the visual response) as a function of the level of degradation. For the first three windows in the “best” row, there is significant invariant delay activity. In the last two levels of degradation, the delay activity drops and becomes indistinguishable from the delay rate for the “worst” stimuli. The rate in the first second of the ISI, following a strong visual response, is significantly higher than the mean rate of the delay interval. This is due to the latency of the external stimulus. (See section 3.5.)
Paradigmatic Working Memory Cell
1083
Figure 6: Testing information transmission across trials: spike rate histograms for best and worst second stimulus and the corresponding table of mean rates (as in Figure 5). The histogram windows are divided in four intervals: second stimulus (underlined), postsecond stimulus, central part of intertrial interval (invisible), and the prestimulus of the successive trial. Note that the prestimulus activity of the trial following best second stimulus is significantly higher than that following the worst second stimulus. It carries the information about the second stimulus of the previous trial (see the text and Figure 7).
and worst stimuli (plotted above and below the axis, respectively, though both are positive valued, of course).3 The distributions differ significantly according to the T-test: P(Tst)< 10−5 . If this elevated rate were to continue beyond the monkey’s behavioral response, it would arrive at the following trial as an elevated rate and appear
3 In calculating the average poststimulus rate, we take an interval starting 250 ms after the second stimulus is turned off, to exclude latency of response to the stimulus. (See section 3.5.)
1084
Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev
0.4 0.2 0 -0.2 -0.4 0
5
10
15
20
25
30
0
5
10
15
20
25
30
Figure 7: Trial-by-trial rate distributions of post–second stimulus (left) in s−1 and pre–first stimulus of the next trial (right) for best (continuous line) and worst (dashed line, for clarity in negative values). The two distributions differ significantly in the second case, carrying the information about the second stimulus of the trial. The poststimulus interval is on average 710 ms, starting 250 ms after the stimulus, to avoid latency effects.
as an elevated prestimulus activity before the next trial. There are previously reported indications that delay activity does in fact survive motor reaction events (Nakamura & Kubota, 1995), though here there is a novelty in that the postreaction activity is not motivated by the behavioral paradigm. As a test of this possibility we separated the average prestimulus rates over the entire set of trials (N = 123 trials) into two sets, according to the preceding second stimulus (see Figure 6). It turns out that the average prestimulus rate in the group of trials following trials in which the second stimulus was the best stimulus (N = 62) is 8.6 s−1 while for those following trials where the second stimulus was the worst stimulus (N = 61), the average prestimulus rate is 5.7 s−1 . (This difference is significant according to the T-test: P(Tst)< 10−5 .) Yet another test of this effect will be discussed in section 3.6. A simple tool suggests itself for improved online monitoring of the relevant prestimulus activity: to reduce (though not fully eliminate) the effect of persistent post–second stimulus high rate, one should consider the average prestimulus activity over all trials, including all stimuli. This will help, provided the number of DADs in which the cell under observation participates is relatively low. That would be the typical case.4 (See also Miyashita & Chang, 1988.) 3.3 Attractor Dynamics. Attractor dynamics (associative memory) as a description of persistent DADs implies that each DAD is excited by a whole class of stimuli. Each of these stimuli raises the rates of a sufficient number of cells belonging to the DAD so that via the learned collateral synaptic ma4
We are indebted to Nicolas Brunel for this observation.
Paradigmatic Working Memory Cell
1085
trix, they can excite the entire subset of neurons characterizing the attractor and maintain them with elevated rates when the stimulus is removed. The class of stimuli leading to the same persistent distribution is the basin of attraction of that particular attractor. As one moves in the space of stimuli, at some point the boundary of the basin of attraction of the attractor is reached. Moving even slightly beyond this boundary, the network will relax either into another learned attractor or to the grand attractor of spontaneous activity. (See Amit & Brunel, 1997; and Amit, 1995.) To test the validity of the attractor picture one has to be able to do the following: 1. Move in the space of stimuli by steps that are not too big, so that there is a choice between staying inside the basin and moving out. Recall that moving in this space does not mean only changing the stimuli, but changing them in a way that is IT observable. 2. Find IT-observable changes in stimuli and have the delay activity unchanged. 3. Arrive at IT-observable changes in stimuli that will go over the edge (i.e., will not evoke the same DAD). A convincing demonstration of these phenomena would require recording a large number of cells, especially since most cells with high rates in the same DAD (corresponding to the same stimulus) are predicted to lose their rates together. The prototypical behavior of such cells expresses itself on our single cell. First, Figure 5 demonstrates the required motion in stimulus space at the level of the IT cortex. In the five windows for the best stimulus as well as in those for the worst, going from left to right, corresponding to increasing the degradation level as presented in Figure 4, there is a clear variation in the visual response: the neuron’s spike rate during stimulation. This implies that the choice of the degradation mode is an IT-observable motion in stimulus space, as required in point 1 above. Second, in the top three windows in the best stimulus row, the delay activity is essentially invariant, as would have been expected if the stimulus were moving within the basin of attraction of that DAD. This corresponds to point 2 above. Third, as the level of degradation increases to reach the right-most two windows for the best stimulus, the delay activity disappears (see Figure 5). The average rate in the delay period becomes that of the reduced prestimulus activity discussed above, in accordance with the expectation in point 3 (see the table in Figure 5). This rate is also the same as that during the delay in all five windows corresponding to the worst stimulus. The discussion of point 1 calls for a comment concerning the fact that as the best image is initially degraded, the visual response increases. Only in the fourth and fifth levels of degradation (3, 4) does the visual response decrease. This may seem contradictory, in that one might naively expect
1086
Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev 20 18 16 Delay 14 rate 12 (s 1) 10 8 6 4 2
1
0
4
20
2
3
30
40
50
60
70
80
Stimulus rate (s 1)
Figure 8: Average delay activity rate (in s−1 , y-axis) versus average stimulus response (in s−1 , x-axis) for five levels of degradation of best stimulus. The degradation level is indicated above each data point (0 for pure image). Error bars are standard errors. Note that the stimulus response difference between 1 and 0, where delay activity is essentially constant, is larger than between 0 and 3, where delay activity collapses. Recall that 0 implies pure stimulus.
that the visual response should be maximal for the pure image and decrease monotonically with degradation. But on second thought, it should be clear that there is no reason that a particular cell should be maximally tuned to the image rather than to a degraded version of it. In fact, looking at other cells, we find that although there is not a strong systematic trend for the visual response as a function of the degradation level, on the average the visual response decreases with degradation. Nor is there any argument why there should be (Yakovlev, Fusi, Amit, Hochstein, & Zohary, 1995). What is essential is that there be a set of neurons that have a strong visual response and a corresponding significant elevated delay rate, and when the images change moderately, the delay activity remain invariant and when they change enough, the delay activity change abruptly. Our paradigmatic cell would have belonged to such a class had it existed. This cell gives yet two more strong signals of the attractor property. Looking at the visual response to the degraded best stimulus in decreasing order of rates suggested above (2–1–0–3–4), one observes in the table in Figure 5 and in Figure 8 that the rate difference from 1–0 is 14.5 s−1 and there is no noticeable change in delay activity. Yet going from 0–3 the visual response changes by 13.1 s−1 , and the delay activity collapses. In other words, the decrease in the delay activity seems precipitous, as a function of the visual response, as would befit a watershed at the edge of the basin of attraction. The second signal is related to the fact that in the images used for testing the attractor property (see Figure 4), as the degradation noise is introduced, a circle appears around the figure (for technical reasons). This might have been perceived by the monkey as a different stimulus. That is, as the figure gets increasingly blurred, it is the circle that is most visible. Yet the delay ac-
Paradigmatic Working Memory Cell
1087
Table 1: Correlations of Delay Activity in Subintervals of ISI First Stimulus Selection
Degradation Levels
Time Intervals
Number of Trials
MRC
B+W B B B W
0-1-2 0-1-2-3-4 0-1-2 3-4 0-1-2
ISI1-ISI5 ISI1-ISI5 ISI1-ISI5 ISI1-ISI5 ISI1-ISI5
74 61 36 25 38
0.399 0.382 0.080 −0.123 0.089
Note: The intervals are the first 1 second following the removal of the stimulus and the last 1 second of the delay interval.
tivity for the first three degradation levels (0–2) of the best stimulus remains unchanged with the appearance of the circle. The delay activity disappears when the degradation level crosses the critical level, despite the continued presence of the circle. Moreover, the circle produces no effect for the worst stimulus, despite the fact that it is visible there as well. This may be interpreted as due to the fact that the degradation circle appears only during testing and is not seen often enough to have been learned. 3.4 Delay Interval MRC. To test the attractor interpretation more closely, we calculated the MRC (see section 2.3) between the average rates in the 1 second immediately following the first stimulus and the 1 second immediately preceding the second stimulus. The results are summarized in Table 1. In a subset of trials with best and worst first stimuli and the first three degradation levels (0–2) we find MRC = 0.399. Note that in this subset of 74 trials, there are 36 trials with elevated delay activity and 38 with low delay activity. In other words, the high MRC may be attributed to the presence of two underlying subpopulations with different means. Similarly, in the subset of trials with best first stimuli and all degradation levels (second row in the table), there are 61 trials, of which 36 have elevated delay activity and 25 do not, and MRC = 0.382. The 61 trials with best first stimulus are then divided into two subsets: one for the first three degradation levels (Nst = 36) and one for the last two (Nst = 25). In the first subset there is elevated delay activity; there is none in the second. For these two subsets we find MRC = 0.080 and −0.123, respectively. We conclude that the average delay rates in the first subset of trials remain high throughout the ISI and low in the second set. The deviations from the mean in each subset, at the beginning and the end of the ISI, are uncorrelated. This constitutes an example of the first option described at the end of section 2.3 and provides scales for high and low correlations.
1088
Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev
Table 2: Single Neurons versus Attractor Dynamics: Correlations Between Visual Response and Several 1-Second Intervals of the ISI First Stimulus Selection
Degradation Levels
Time Intervals
Number of Trials
MRC
B B+W B B
0-1-2 0-1-2-3-4 0-1-2 0-1-2
ST1-ISI1 ST1-ISI1 ST1-ISI4 ST1-ISI5
36 123 36 36
0.407 0.784 0.085 −0.098
Note: First two rows have high MRCs due to latency of visual response. No correlation in later intervals.
3.5 Single Neurons versus Attractors Dynamics. An alternative scenario to the attractor picture may attribute the enhanced delay activity to some change in the internal state of single cells. Such a scenario would imply that the change in the internal state of the cell is triggered by the level of the visual response. This would be required to make the persistent delay activity stimulus selective. It would further imply a correlation between the rate during the ISI and the rate during the visual response. Table 2 summarizes the MRCs between the visual response to the first stimulus and various 1 second intervals of the ISI. We find that the rate during the visual response to the first stimulus is correlated with the rate in the first 1 second interval of the ISI (MRC = 0.407), even in a subset of trials with stimuli all leading to elevated delay activity, that is, the 36 trials corresponding to the best stimulus at degradation levels 0–2. This high MRC in a trial set with a narrow distribution of rates indicates that the trial-by-trial fluctuations of the rate in the first 1 second of the ISI are correlated with the fluctuations in the rate during the visual response. In fact, if the effect of splitting the means is added, by computing the MRC of the visual response rates with the rates in the first 1 second of the ISI for all trials (Nst = 123), we find MRC = 0.784. These high correlations we associate with the latency of the visual response. In fact, it appears from Figure 5 that the visual response penetrates about 250 ms into the ISI. Next we take the same set of 36 trials (best with degradation 0–2), which gave MRC = 0.407 between the rate in the visual response and the rate of the first 1 second of the ISI. The MRCs between the rate of the visual response and the rate in the fourth and fifth 1 seconds of the ISI are 0.085 and –0.098, respectively (third and fourth rows in the table). We conclude that beyond the first second of the ISI, the fluctuations of the visual response are not correlated with the fluctuations of the delay activity. This makes the single-cell scenario rather implausible.
Paradigmatic Working Memory Cell
1089
3.6 Information Transmission Across Trials. The discussion in section 3.2 of the meaning of prestimulus activity provides a partial confirmation of the scenario of information transport between first stimuli in consecutive trials, essential for the formation of the Miyashita correlations (Miyashita, 1988). That scenario requires that information about the coding of one stimulus be able to traverse the time interval between consecutive trials. The scenario proposed (Griniasty et al., 1993; Amit et al., 1994; Amit, 1995) is that DADs, when they first become stable, are uncorrelated. When they are stable, they would be provoked as much by a (same) second stimulus in a trial as by the first one. Since the second stimulus in half of the trials is the same as the first (to prevent bias in the DMS paradigm), in half of the training trials the activity in the intertrial interval will be the same as the delay activity elicited by the first stimulus of the preceding trial. If training is done by a fixed sequence of first stimuli, then upon half of the presentations of a given first stimulus, the activity distribution in the IT module, in the intertrial interval, would be the same as the delay activity corresponding to the immediately preceding first stimulus. The joint presence of the delay activity of the previous stimulus and the activity stimulated by the current first stimulus is hypothesized to be the information source for the learning, into the synaptic structure, of the correlations between the delay activities corresponding to consecutive images. We calculated the MRCs between the average rate in the 1 second immediately following the second stimulus and the average rate in an interval of 1 second immediately preceding the successive first stimulus (see Table 3). The idea is that since (for technical reasons) we do not have recordings during the entire interval separating two trials, if the activity between trials were to be undisturbed by the monkey’s response, to arrive at the next first stimulus, the MRC of the activities in the two extreme subintervals between trials should be similar to that between the two extreme subintervals in the undisturbed ISI. In fact, MRC = 0.320 for the intertrial interval performed over all the available trials (Nst = 123). The entire set of trials is composed of two well-balanced subsets with two different intertrial mean rates. This is so because the second stimulus is never degraded, so for Nst = 62 the second stimulus is a pure best image, leading to high intertrial persistent activity, and for Nst = 61 of the trials, the preceding second stimulus is the pure worst image, leading to low intertrial activity. This value is compared with the MRCs of extreme ISI intervals in trial subsets of similar statistics, with balanced subsets of low and high averages. Those would correspond to the first and second row in Table 1. In fact, the numbers are close. If the 123 trials are separated into two subsets—one with preceding best and one with preceding worst—we find MRC = 0.038 and MRC = 0.120, respectively (rows 2 and 3 in Table 3). These would be naturally compared with MRCs of subintervals of trials with best and worst first stimulus, with degradation levels (0–2) (third and fifth rows in Table 1) that are, respectively, 0.080 (Nst = 36) and 0.089 (Nst = 38).
1090
Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev
Table 3: Information Transmission Across Trials: MRCs of 1 Second Post–Second Stimulus and 1-Second Pre–Next First Stimulus Second Stimulus of Previous Trials
Time Intervals
B+W B W
POST and PRE POST and PRE POST and PRE
Number of Trials
MRC
123 62 61
0.320 0.038 0.120
Note: For a few trials, the time interval after the second stimulus is less than 1 second; for those trials, the mean rate is calculated over the available time interval. First row: the high MRC is due to presence of two different means. Last two rows: absence of correlation in unimodal subsets of trials.
4 Discussion The cell we have discussed presents a rich phenomenology, naturally interpreted in the learning-attractor paradigm. To establish the various characteristics of attractor dynamics and their internal structure, many more good recorded cells are required. Our underlying intention has been to exploit this cell merely as an example of the kind of features that a cell participating in a presumed attractor scenario may have. Looking back at the wealth of data drawn from this cell, we feel we can say more. This collection of data can hardly be accounted for in other paradigms suggested to date. Most prominent of these is the propagation of the persistent elevated rates following the test stimulus, after this matching stimulus is turned off, and even after the behavioral reaction is completed. The amount of data collected for this cell and the tools we have sharpened to analyze them leave no doubt that this propagation can indeed take place. In fact, we have found that this persistent activity is as robust when it crosses the intertrial interval (including the behavioral response) as it is crossing the ISI delay interval between sample and match. What is surprising is that following the second stimulus and the reaction, there is no apparent need for maintaining the active memory. One might have expected that the persistent activity distribution would die down at this point. It does not. This has been observed also by Nakamura and Kubota (1995), but the novelty here is that the delay activity continues to persist even when there seems to be no functional role for it. What seriously modifies the delay activity is a new visual stimulus—either the second stimulus following the ISI or the first in another trial, much as in Miller, Li, and Desimone (1993). The fact that a DAD can get across the intertrial interval is an essential pillar in the construction of the dynamic learning scenario for the generation of the Miyashita (1998) correlations in the internal representations (working
Paradigmatic Working Memory Cell
1091
memory) of stimuli often experienced in contiguity. Our single-cell result suggests that such correlations may find their origin in the arrival of selective persistent activity related to one stimulus at the presentation of the consecutive stimulus in the next trial. The collective attractor picture would have been fuller had we had a case in which there is no visual response and yet elevated delay activity. In Sakai and Miyashita (1991) such cases are shown. Yet the mean rate correlation analysis on this cell shows that neural activity during the delay period is correlated with the visual response only in the immediate interval following the stimulus—strong evidence that what determines the actual spike rate later into the ISI is beyond the single cell. It is most likely the result of the interaction of this cell with other cells that participate in the same DAD, via potentiated synapses. Finally, there is an additional, behavioral correlate to the activity of this cell. When the performance level (percentage correct responses) of the monkey is considered versus the level of stimulus degradation, averaged across experiments, it is found that the performance is similar for the first three levels of degradation. Then, for the fourth and fifth levels, performance drops abruptly to chance level, just as our cell indicates that the stimuli are not now in the basin of attraction of the DAD (Yakovlev et al., 1995). This effect calls for much further study, and it opens new vistas toward identification of a potential functional role of the delay activity. Sakai and Miyashita (1988) turned to the pair associate paradigm because the monkeys perform the DMS task well even for stimuli that do not generate delay activity (“new stimuli”). Here we see that in the absence of delay activity, the DMS task was not performed successfully if the sample stimulus was very degraded. Our paradigmatic cell suggests that the mechanism that allows matching to an identical new sample may not work when the monkey has to match to a very degraded image. Yet as long as the DAD (the attractor) is effective, it corrects for the degradation, by attracting to the prototype delay activity, and the task can be performed. Acknowledgments Without the experimental and intellectual contribution of Shaul Hochstein and Ehud Zohari, this article would not have been possible. Had we had it our way, both would have figured among the authors. We have benefited from the contribution of Gil Rabinovici to the experiment. This cell was found during his stay in our lab as a summer project in his course of premedical studies at Stanford University. We are also indebted to J. Maunsel and R. Shapley for help in the early stages of this experiment and the detailed comments of Misha Tsodyks. This work was partly supported by grants from the Israel Ministry of Science and the Arts and the Israel National Institute of Psychobiology and by Human Capital and Mobility grant ERB-CHRX-CT93-0245 of the EEC.
1092
Daniel J. Amit, Stefano Fusi, and Volodya Yakovlev
References Amit, D. J. (1994). Persistent delay activity in cortex: A Galilean phase in neurophysiology? Network 5:429. Amit, D. J. (1995). The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioural and Brain Science 18:617. Amit, D. J., & Brunel, N. (1995). Learning internal representations in an attractor neural network with analogue neurons. Network 6:39. Amit, D. J., & Brunel, N. (1997). Global spontaneous activity and local structured (learned) delay activity in cortex. Cerebral Cortex 7(2):237. Amit, D. J., Brunel, N., & Tsodyks, M. V. (1994). Correlations of cortical Hebbian reverberations: Experiment vs theory. J. Neurosci. 14:6435. Brunel, N. (1994). Dynamics of an attractor neural network converting temporal into spatial correlations. Network 5:449. Brunel, N. (1996). Hebbian learning of context in recurrent neural networks. Neural Computation 8:1677. Fuster, J. M. (1995). Memory in the cerebral cortex. Cambridge, MA: MIT Press. Griniasty, M., Tsodyks, M. V., & Amit, D. J. (1993). Conversion of temporal correlations between stimuli to spatial correlations between attractors. Neural Computation 5:1. Kovacs, G., Vogels, R., & Orban, G. A. (1995). Selectivity of macaque inferior temporal neurons for partially occluded shapes. J. Neurosci. 15:1984. Miyashita, Y., & Chang, H. S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature 331:68. Miller, E. K., Li, L., & Desimone, R. (1993). Activity of neurons in anterior inferior temporal cortex during a short-term memory task. J. Neurosci. 13:1460. Miyashita, Y. (1988). Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature 335:817. Miyashita, Y., Higuchi, S., Sakai, K., & Masui, N. (1991). Generation of fractal patterns for probing visual memory. Neurosci Res. 12:307. Nakamura, K., & Kubota, K. (1995). Mnemonic firing of neurons in the monkey temporal pole during a visual recognition memory task. J. Neurophysiol. 74:162. Sakai, K., & Miyashita, Y. (1991). Neural organisation for the long-term memory of paired associates. Nature 354:152. Williams, G. V., & Goldman-Rakic, P. S. (1995). Modulation of memory fields by dopamine D1 receptors in prefrontal cortex. Nature 376:572. Wilson, F. A. W., Scalaidhe, S. P. O., & Goldman-Rakic, P. S. (1993). Dissociation of object and spatial processing domains in primate prefrontal cortex. Science 260:1955. Yakovlev, V., Fusi, S., Amit, D. J., Hochstein, S., & Zohary, E. (1995). An experimental test of attractor network behavior in IT of the performing monkey. Israel J. of Medical Sciences 31:765.
Received June 7, 1996; accepted September 11, 1996.
Communicated by Todd Leen and Christopher Bishop
Noise Injection: Theoretical Prospects Yves Grandvalet St´ephane Canu CNRS UMR 6599 Heudiasyc, Universit´e de Technologie de Compi`egne, Compi`egne, France
St´ephane Boucheron CNRS-Universit´e Paris-Sud, 91405 Orsay, France
Noise injection consists of adding noise to the inputs during neural network training. Experimental results suggest that it might improve the generalization ability of the resulting neural network. A justification of this improvement remains elusive: describing analytically the average perturbed cost function is difficult, and controlling the fluctuations of the random perturbed cost function is hard. Hence, recent papers suggest replacing the random perturbed cost by a (deterministic) Taylor approximation of the average perturbed cost function. This article takes a different stance: when the injected noise is gaussian, noise injection is naturally connected to the action of the heat kernel. This provides indications on the relevance domain of traditional Taylor expansions and shows the dependence of the quality of Taylor approximations on global smoothness properties of neural networks under consideration. The connection between noise injection and heat kernel also enables controlling the fluctuations of the random perturbed cost function. Under the global smoothness assumption, tools from gaussian analysis provide bounds on the tail behavior of the perturbed cost. This finally suggests that mixing input perturbation with smoothness-based penalization might be profitable. 1 Introduction Neural network training consists of minimizing a cost functional C(.) on the set of functions F realizable by multilayer Perceptrons (MLP) with fixed architecture. The cost C is usually the averaged squared error, µ ¶2 C( f ) = EZ f (X) − Y
(1.1)
where the random variable Z = (X, Y) describing the data is sampled according to a fixed but unknown law. Because C is not computable, an Neural Computation 9, 1093–1108 (1997)
c 1997 Massachusetts Institute of Technology °
1094
Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron
empirically computable cost is then minimized in applications using a sample z` = {zi }`i=1 , with zi = (xi , yi ) ∈ Rd ×R gathered by drawing independent identically distributed data according to the law of Z. An estimate fˆemp of the regression function f ∗ (x) = arg min f ∈L2 C( f ) is given by the minimization of the empirical cost Cemp : Cemp ( f ) =
` 1X ` i=1
µ
¶2 f (xi ) − yi
.
(1.2)
The cost Cemp (.) is a random functional with expectation C(.). In order to justify the minimization of Cemp , the convergence of the empirical cost toward its expectation should be uniform with respect to f ∈ F (Vapnik, 1982; Haussler, 1992). When F is too large, this may not hold against some sampling laws. Hence practitioners have to trade the expressive power of F with the ability to control the fluctuations of Cemp (.). This suggests analyzing modified estimators that possibly restrict the effective search space. One of these modified training methods consists of applying perturbations to the inputs during training. Experimental results in Sietsma and Dow (1991) show that noise injection (NI) can dramatically improve the generalization ability of MLP. This is especially attractive because the modified minimization problem can be solved thanks to the initial training algorithm. During NI, the original training sample z` is distorted by adding some noise η to the inputs xi while leaving the target value yi unchanged. During the kth epoch of the backpropagation algorithm, a new distortion η k is applied to z` . The distorted sample is then used to compute the error and to derive the weights updates. A stochastic algorithm is thus m , defined as: obtained that eventually minimizes CNIemp C
m NIemp
m ` 1X 1 X (f) = m k=1 ` i=1
µ
¶2 i
k,i
f (x + η ) − y
i
,
(1.3)
where the number of replications m is set under user control but finite. The average value of the perturbed cost is: "
` 1X CNI ( f ) = Eη ` i=1
¶2 #
µ i
f (x + η ) − y
i
.
(1.4)
In this article, the noise η is assumed to be a centered gaussian vector with independent coordinates: E[η ] = 0 and E[η T η ] = σ 2 I. The success of NI is intuitively explained by asserting that minimizing CNI (see equation 1.4) ensures that similar inputs lead to similar outputs. It raises two questions: When should we prefer to minimize CNI rather than m converge toward C Cemp ? How does CNIemp NI ?
Noise Injection
1095
Recently, several authors (Webb, 1994; Bishop, 1995; Leen, 1995; Reed, Marks, & Oh, 1995; An, 1995) have resorted to Taylor expansions to describe the impact of NI and to motivate the minimization of CNI rather than Cemp . They not only try to provide a formal description of NI but also aim at m . finding a deterministic alternative to the minimization of CNIemp This article takes a different approach: when the injected noise is gaussian, the Taylor expansion approach is connected to the action of the heat kernel, and the dependence of CNI (see equation 1.4) on the noise variance is shown to obey the heat equation (see section 2.1). This clear connection between partial differential equations and NI provides some indications on the relevance domain of traditional Taylor expansions (see section 2.2). Finally, we analyze the simplified expressions that are assumed to be valid locally around optimal solutions (see section 2.3). The connection between NI and the action of the heat kernel also enables control of the fluctuations of the random perturbed cost function. Under some natural global smoothness property of the class of MLPs under consideration, tools from gaussian analysis provide exponential bounds on the probability of deviation of the perturbed cost (see section 3.3). This suggests that mixing NI with smoothness-based penalization might be profitable (see section 3.4). 2 Taylor Expansions 2.1 Gaussian Perturbation and Heat Equation. To exhibit the connection between gaussian NI and the heat equation, let us define u as a function from R+ × R`×d by:
u(0, x) =
` 1X ` i=1 "
u(t, x) = Eη
µ
¶2 f (xi ) − yi
` 1X ` i=1
(2.1) ¶2 #
µ i
i
f (x + η ) − y
i
.
(2.2)
Obviously, we have u(0, x) = Cemp ( f ) and u(t, x) = CNI ( f ), when t is the noise variance σ 2 . Each value of the noise variance defines a linear operator Tt that maps u(0, .) onto u(t, .). Moreover since the sum of two independent with variances s and t is gaussian with variance s + t, the family ¢ ¡gaussian Tt t≥0 defines a semigroup, the heat semigroup (cf. Ethier & Kutrz, 1986, for an introduction to semigroup operators). The function u obeys the heat equation (cf. Karatzas & Shreve, 1988, chap. 4, sec. 3, 4): 1 ∂u = ∆xx u ∂t 2
(2.3)
1096
Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron
where ∆xx is the Laplacian with respect to x, and where the initial conditions are defined by equation 2.1. For the sake of self-containment, a derivation of equation 2.3 is given in the appendix when initial conditions are square integrable. Let us denote by CNI ( f, t) the perturbed cost when the noise is gaussian of variance is t; equation 2.3 yields: CNI ( f, t) = Cemp ( f ) +
1 2
Z 0
t
∆xx CNI ( f, s) ds.
(2.4)
Therefore, CNI can be investigated in the purely analytical framework of partial differential equations (under the gaussian assumption). The possibility of forgetting about the original probabilistic setting when dealing with neural networks follows from the Tikhonov uniqueness theorem (Karatzas & Shreve, 1988, chap. 4, sec. 4.3). Observation 2.1. If F is a class of functions definable by some feedforward architecture using sigmoidal, radial basis functions, or piecewise polynomials as activation functions, and if injected noise follows a gaussian law, then the perturbed cost CNI is the unique function of the variance t that obeys the heat equation (2.1) with initial conditions defined in equation 2.1. Any deterministic faithful simulation of NI should use some numerical analysis software to integrate the heat equation and then run some backpropagation software on the result. We do not recommend such a methodology for efficiency reasons and insist that stochastic representations of solutions of partial differential equations (PDEs) have proved useful in analysis (Karatzas & Shreve, 1988). Methods reported in the literature (such as finite differences and finite element; cf. Press, Teukolsky, Vetterling, & Flannery, 1992) assume that the function defining the initial conditions has bounded support. This assumption that makes sense in physics is not valid in the neural network setting. Hence Monte-Carlo methods appear to be the ideal technique to solve the PDE problem raised by NI. 2.2 Taylor Expansion Validity Domain. Let CTaylor ( f ) be the first-order Taylor expansion of CNI ( f ) as a function of t = σ 2 . For various kinds of noise and function classes F , it has been shown in Matsuoka (1992), Webb (1994), Grandvalet and Canu (1995), Bishop (1995), and Reed et al. (1995) that: CTaylor ( f ) = Cemp ( f ) +
σ2 ∆xx Cemp ( f ). 2
(2.5)
In the context of gaussian NI, this means that the Laplacian is the infinitesimal generator of the heat semigroup. To emphasize the distinction
Noise Injection
1097
between equations 2.5 and 2.3, one should stress the fact that the heat equation is not only a correct description of the impact of gaussian NI in the small variance limit but also for any value of the variance. The Taylor approximation validity domain is restricted to those functions f such that lim
σ 2 →0
CNI ( f ) − CTaylor ( f ) = 0. σ2
(2.6)
Observation 2.2. A sufficient condition for the Taylor approximation to be valid is that Cemp belongs to the domain of the generator of the heat semigroup. The empirical cost Cemp has to be a licit initial condition for the heat equation, which is always true in the neural network context (cf. conditions in Karatzas & Shreve, 1988, theorems 3.3, 4.2, chap. 4). The preceding statement is purely analytical and does not say much about the relevance of minimizing CTaylor while training neural networks. This issue may be analyzed according to several directions: Is the minimization of CTaylor equivalent to the minimization of CNI ? Is the minimization of CTaylor interesting in its own right? The second issue and related developments are addressed in Bishop (1995) and Leen (1995). The first issue cannot be settled in a positive way for arbitrary variances in general, but the principle of minimizing CNI (and CTaylor ) should be definitively ruled out if if the minima of CNI (and CTaylor ) did not converge toward the minima of Cemp when t = σ 2 → 0. This cannot be deduced directly from equation 2.6 since it describes only simple convergence. A uniform convergence over F is required. As we will vary t, let us denote by CNI ( f, t) (respectively CTaylor ( f, t)) the perturbed cost (resp. its Taylor approximation) of f when the noise variance is t, we get: 1 CNI ( f, t) − CTaylor ( f, t) = 2
Z t· 0
¸
∆xx CNI ( f, s) − ∆xx Cemp ( f ) ds.
(2.7)
If some uniform (over F and s ≤ t0 < 0 ) bound on |∆xx CNI ( f, s) − ∆xx Cemp ( f )| is available, then limt→0 max f ∈F CNI ( f, t)−CTaylor ( f, t) = 0, and small manipulations using the triangular inequality reveal the convergence of minima. The same argument shows the convergence of the minima of CNI toward minima of Cemp . If some upper bounds is imposed on the weights of a sigmoidal neural network, those global bounds are automatically enforced. Imposing bounds on weights is an important requirement to ensure the validity of the Taylor approximation. We intuitively expect the truncation of the Taylor series to be valid in the small variance limit. But if f (x) = g(wx), where g is a parameterized function and w is a free parameter, then
Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron
Cemp
CTaylor
1098
0
0 −50 a
−100
−0.5
0
0.5
−50 a
b
−0.5
−100
0.5
0 b
CNI
C
a=−50
−0.5
0
0.5
C
a=−100 0 −50 a
−100
−0.5
0 b
0.5 −0.5
0 b
0.5
Figure 1: Costs Cemp (top left), CTaylor (top right), and CNI (bottom left) in the space (a, b). These costs are compared on the bottom-right figures for two values of the parameter a: Cemp is solid, CTaylor is dotted, and CNI is dashed.
f (x + η) = g(wx + wη). The noise η injected in f appears to g as a noise wη , that is, as a noise of variance w2 σ 2 . Therefore, if g is nonlinear (i.e., f nonlinear), the Taylor expansion will fail for large w2 . A simple illustration is given in Figure 1. The sample contains 10 (x, y) pairs, where the xi are regularly placed on [−0.5, 0.5], and y = 1x<0 : z` = {(−0.5, 1), (−0.4, 1), . . . , (−0.1, ± 1), (0.1, 0), . . . , (0.5, 0)}. The class F is defined by f (x) = exp(ax + b) (1 + exp(ax + b)) a, b ∈ R, and the quadratic loss is used. The three error surfaces for Cemp , CTaylor , and CNI are represented for σ 2 = 2.5 10−3 . Although the noise variance is quite small, there are some noticeable differences: the cost CNI is smoother than Cemp (effect of convolution), which is in turn smoother than CTaylor . The dissimilarities of these surfaces can be shown for any nonzero value of σ 2 by a proper choice of the scale. The bottom right drawings show the error as a function of b for two values of the scale parameter a. It is shown how the Taylor expansion becomes more and more innacurate as a2 increases. Note that
Noise Injection
1099
the number of oscillations on CTaylor is equal to the number of points in the sample. Remark. Although equation 2.5 has been established for various kinds of noise (strong integrability properties are sufficient), CNI is the solution of the heat equation only for gaussian noise. In nongaussian contexts NI usually does not define a semigroup of operators indexed by variance. 2.3 Local Analysis and Simplified Expressions. Requiring uniform bounds on |∆xx CNI ( f, s) − ∆xx Cemp ( f )| may be too demanding. Fortunately, it is possible to adapt the local approach to NI suggested in Leen (1995) and Bishop (1995) to provide another derivation of the behavior of the minima of CNI and CTaylor . The local approach is concerned with the behavior of CNI , CTaylor , and possibly other cost functions such as the derivative-based regularizer (Leen, 1995; Bishop, 1995) around minima (global or local) of Cemp . Although Bishop and Leen focused their attention on the case where E(Y|X = x) is realizable by the neural architecture and the empirical measure closely mimics the sampling law (e.g., Leen, 1995, sec. 3.1.1), we believe that their local approach can be extended and applied to the comparison of the critical points of Cemp , CNI , and CTaylor . A critical point of Cemp (and similarly for other costs) is a weight assignment where the gradient of Cemp vanishes. A critical point is nondegenerate iff the Hessian matrix has full rank. In the sequel, we will assume that Cemp has only nondegenerate critical points. This is not a major restriction since the set of target values yi for which Cemp does have degenerate critical points has measure 0 for the kind of neural architectures mentioned here. Moreover, the number of nondegenerate critical points is finite (Sontag, 1996). Observation 2.3. If Cemp has no degenerate critical points, then there exists some t0 > 0 such that for any t, 0 ≤ t ≤ t0 , to any critical point of Cemp , there correspond a critical point of CTaylor and a critical point of CNI ; moreover, those critical points are within distance Kt of each other for some constant K that depends on the sample under consideration. Proof. The argument extends Leen’s (1995) suggestion. In the course of the argument we will assume that CNI and CTaylor have partial derivatives with respect to t at t = 0; this can be enforced by taking a linear continuation for t < 0. Let us assume that F is parameterized by W weights. Let ∇w CNI ( f, t) and ∇w CTaylor ( f, t) denote the gradient of CNI and CTaylor with respect to the weight assignment for some value of f and σ 2 = t (note that at t = 0 the two values coincide with ∇w Cemp ( f )). If f • is some nondegenerate critical point of Cemp , then the matrix of partial derivatives of ∇w CNI ( f, t) and ∇w CTaylor ( f, t) with respect to weights and time has full rank; thus, by the implicit function theorem in its surjective form (Hirsch, 1976, p. 214), there exists a neighborhood of ( f • , 0),
1100
Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron
and diffeomorphisms φ and ψ defined in a neighborhood of (0, 0) ∈ RW+1 , such that φ(0, 0) = ( f • , 0) (resp. ψ(0, 0) = ( f • , 0)) and ∇w CNI (φ(u, v)) = u (resp. ∇w CTaylor (ψ(u, v)) = u). The gradients of φ and ψ at (0, 0) are of L2 norm less or equal than the norm of the matrix of partial derivatives of ∇w Cemp ( f • , 0). For sufficiently small values of t, φ(0, t) and ψ(0, t), define continuous curves of critical points of CNI (., t) and CTaylor (., t). As the number of critical points of Cemp is finite, we may assume that those curves do not intersect and that the norms of the gradients of φ(0, t) and ψ(0, t) with respect to t are upper bounded. The observation follows. Remark 1. For sufficiently small t, the local minima of Cemp can be injected in the set of local minima of CTaylor and CNI . For CTaylor , the reverse is true. The definition of CTaylor obeys the same constraints as the definition of Cemp : it is defined using solely +, ×, constants, and exponentiation; hence, by Sontag bound, for any t, except on a set of measure 0 of target values y, the number of critical points of CTaylor is finite, and the argument that was used in the proof works in the other direction. For sufficiently small t, CTaylor does not introduce new local minima. This argument cannot be adapted to CNI , which definition also requires an integration. Nevertheless, experimental results suggest that NI tends to suppress spurious local minima (Grandvalet, 1995). Remark 2. The validity of observation relies on the choice of activation functions. If F were constituted by the class of functions parameterized by α ≥ 0 mapping, x 7→ sin(αx). One could manufacture a sample such that Cemp and CTaylor both have countably many nondegenerate minima, which are in one-to-one correspondence and such that the convergence of the minima of CNI toward the minima of Cemp as t tends toward 0 is not uniform.
3 Noise Injection and Generalization The alleged improvement in generalization provided by NI remains to be analytically confirmed and explained. Most attempts to provide an explanation resort to the concepts of penalization and regularization. Usually penalization consists of adding a positive functional Ä(.) to the empirical risk. It is called a regularization if the sets { f : f ∈ F , Ä( f ) ≤ α} are compact for the topology on F . Penalization and regularization are standard ways of improving regression and estimation techniques (e.g., Grenander, 1981; Vapnik, 1982; Barron, Birge, & Massart, 1995). When F is the set of linear functions, NI has been recognized as a regularizer in the Tikhonov sense (Tikhonov & Arsenin, 1977). This is the only reported case, but it is enough to consider F as constituted by univariate degree 2 polynomial to realize that NI cannot generally be regarded as a
Noise Injection
1101
penalization procedure (Barron et al., 1995): CNI − Cemp is not always positive. Thus, the improvement of generalization ability attributed to NI still requires some explanations. 3.1 Noise Injection and Kernel Density Estimation. An appealing interpretation connects NI with kernel estimation techniques (Comon, 1992; Holmstrom ¨ & Koistinen, 1992; Webb, 1994). Minimization of the empirical risk might be a poor or inefficient heuristic because the minima (if there are any) of Cemp could be far away from those of C. Recall that when trying to perform regression with sigmoidal neural networks, we have no guarantees that CNI has a single global minima, or even that the infimum of CNI is realized (Auer, Hebster, & Warmuth, 1996). Hence the safest (and, to our knowledge, only) way to warrant the consistency of the NI technique is to get a global control on the fluctuations of CNI (.) with respect to C(.), that is, on supF |CNI ( f )−C( f )|. The poor performance of minimization of empirical risk could be due to the slow convergence of the empir1 P ical measure pˆZ = i δxi ,yi toward the sampling probability in the pseudo1
ˆ pˆ0 ) = sup f ∈F |Epˆ ( f (X)−Y)2 −Epˆ 0 ( f (X)−Y)2 |). metric induced by F ( dF (p− But minimizing CNI is equivalent (up to an irrelevant constant factor) to minimizing ` ¡ ¢2 1X Eη ,η 0 f (xi + η i ) − (yi + η 0 ) , ` i=1
(3.1)
where η 0 is a scalar gaussian independent of η . Minimizing CNI consists of minimizing the empirical risk against a smoothed version of the empirical measure: the Parzen-Rosenblatt estimator of the density (for details on the latter, see Devroye, 1987). The connection with gaussian kernel density estimation and regularization can thus be established in a perspective described in Grenander (1981). Because the gaussian kernel is the fundamental solution of the heat equation, the smoothed density that defines equation 3.1 is obtained by running the heat equation using the empirical measure as an initial condition (this is called the Weierstrass transform in Grenander (1981). It is then tempting to explain the improvement in generalization provided by NI using upper bounds on the rate of convergence of the Parzen-Rosenblatt estimator. Though conceptually appealing, this approach might be disappointing; it actually suggests solving a density estimation problem as a subproblem of a regression problem. The former is much harder than the latter (Vapnik, 1982; Devroye, 1987). 3.2 Consistency of NI. To assess the consistency of minimizing CNI , we would like to control CNI (.) − C(.). A reasonable way of analyzing the fluctuations of CNI (.) − C(.) consists of splitting it in two summands and
1102
Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron
bounding each summand separately: |C( f ) − CNI ( f )| ≤ |C( f ) − EpZ CNI ( f )| + |CNI ( f ) − EpZ CNI ( f )|.
(3.2)
The first summand bounds the bias induced by taking the smoothed version of the squared error. It is not a stochastic quantity; it should be analyzed using tools from approximation theory. This first summand is likely to grow with t. On the contrary, the second term captures the random deviation of CNI with respect to its expectation. Taking advantage that CNI depends smoothly on the sample, it may be analyzed using tools from empirical process theory (cf. Ledoux & Talagrand, 1991, chap. 14). It is expected that this second summand decreases with t. 3.3 Bounding Deviations of Perturbed Cost. As practitioners are likely m rather than CNI , another difficulty has to be faced: conto minimize CNIemp m with respect to CNI . For a fixed sample, trolling the fluctuations of CNIemp m is a sum of independent random functions with expectation CNI ; in CNIemp principle, it could also be analyzed using empirical process techniques. In the case of sigmoidal neural networks , the boundedness of the summed m ( f ) converges random variables ensures that for each individual f , CNIemp ¢ √ ¡ m (f) − C almost surely toward CNI ( f ) as m → ∞ and that m CNIemp NI ( f ) converges in distribution toward a gaussian random variable. But using empirical processes would not pay tribute to the fact that the sampling process defined by NI is under user control, and in our case gaussian. Sufficiently smooth functions of gaussian vectors actually obey nice concentration properties, as illustrated in the following theorem: Theorem 3.1 (Tsirelson, 1976; c. f. Ledoux & Talagrand, 1991). Let X be a standard gaussian vector on Rd . Let f be a Lipschitz function on Rd with Lipschitz constant smaller than L; then: ¾ ½ 2 2 P | f (X) − E f (X)| > r ≤ 2 exp−r /2L . m . 3.3.1 Pointwise Control of the Fluctuations of CNIemp If F is constituted by a class of sigmoidal neural networks, the differentiability assumption for the square loss as a function of inputs is automatically fulfilled.
Assumption 1. In the sequel, we will assume that all weights defining functions in F are bounded so that F is uniformly bounded by some constant M0 and the gradient of f ∈ F is smaller than some constant L. Let M be a constant greater than M0 + maxi yi . ¢ P P` ¡ i k,i i 2 is regarded as a function Then if 1/(`m) m k=1 i=1 f (x + η ) − y on R`md provided with √ the Euclidean norm, its Lipschitz constant is upper bounded by 2LM/ `m.
Noise Injection
1103
m (.) implies the following obserApplying the preceding theorem to CNIemp vation:
Observation 3.1.
If F satisfies assumption 1, then: ¾
½
Pη |C
m NIemp
( f ) − CNI ( f )| > r ≤ 2e−m`r
2
/(8tL2 M2 ) .
Remark. The dependence of the upper bound on the Lipschitz constant of CNI with respect to x cannot be improved since theorem 3.1 is tight for linear functions, but the combined dependence on M and L has to be assessed for sigmoidal neural networks. For those networks, large inputs tend to generate small gradients; hence the upper bound provided here may not be tight. m . m −C Pointwise control of CNIemp 3.3.2 Global Control on CNIemp NI is insufm with respect to ficient since we need to compare the minimization of CNIemp m ( f ) is a biased the minimization of CNI . We may first notice that inf f ∈F CNIemp estimator of inf f ∈F CNI ( f ): m ( f ) ≤ inf C Eη inf CNIemp NI ( f ). f ∈F f ∈F
(3.3)
m ( f ) is a concave function of the Second, we may notice that inf f ∈F CNIemp empirical measure defined by η ; hence, it is a backward supermartingale1 and thus converges almost surely toward a random variable as m → ∞. If m is to converge toward C m ( f ) is due to converge toward CNIemp NI , inf f ∈F CNIemp inf CNI . To go beyond this qualitative picture, we need to get a global control on m (f) − C the fluctuations of sup f ∈F |CNIemp NI ( f )|. Let g denote the function: ¯ ¯ ¯ ¯ 1 X¡ ¢ ¯ ¯ i k,i i 2 f (x + η ) − (y ) − CNI ( f )¯ . η 7→ sup ¯ ¯ ¯ m` i,k f ∈F
If F satisfies √ assumption 1, then g is also Lipschitz with a coefficient less than 2LM/ m`. m (f) − C Let us denote Eη sup f ∈F |CNIemp NI ( f )| by E. E may be finite or infinite. E is a function of t, `, m. 1 Up to some measurability conditions that are enforced for fixed architecture neural networks.
1104
Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron
Assumption 2. E is finite. Theorem 3.1 also applies to g , and we get the following concentration result: Observation 3.2. If F is a class of bounded Lipschitz functions satisfying assumptions 1 and 2, then ½ ¾ 2 2 2 m (f) − C Pη sup |CNIemp ( f )| > E + r ≤ 2e−m`r /(8tL M ) . NI f ∈F
(3.4)
This result is only partial in the absence of a good upper bound on E. It is possible to provide explicit upper bounds on E using metric entropy techm process described niques and using the subgaussian behavior of the CNIemp by observation 3.1. Using observation 3.2, we may partially control the deviations of approxm with respect to the infimum of C imate minima of CNIemp NI : Observation 3.3. If F satisfies assumptions 1 and 2, then if for any sample m ( f • ) < inf f • satisfies CNIemp f ∈F CNI ( f ) + ², then
ECNI ( f • ) < inf CNI ( f ) + 2E + ², f ∈F and ¾ ½ 2 2 2 Pη CNI ( f • ) ≥ inf CNI ( f ) + ² + 2(E + r) ≤ 2e−m`r /(8tL M ) . f ∈F If ² and E can be taken arbitrarily close to 0 using proper tuning of m and t, the m is a consistent way of approximating corollary shows that minimizing CNIemp the minimum of CNI . 3.4 Combining Global Smoothness Prior and Input Perturbation. The derivative-based regularizers proposed in Bishop (1995) and Leen (1995) are based on sums of terms evaluated at a finite set of data points. It has been argued that they are not global smoothing priors. At first sight, it may be that NI escapes this difficulty; the previous analysis, and particularly observation 3.1, as rough as it may be, shows that it may be quite hard to control NI in the absence of any smoothness assumption. Thus it seems cautious to supplement NI or data-dependent derivative-based regularizers with global smoothness constraints. For sigmoidal neural networks, weight decay can be combined with NI. The sum M of the absolute values of the weights provides an upper bound on the output values. Let L be the product over the different layers of the
Noise Injection
1105
sum of absolute values of the weights in those layers; L is an upper bound on the Lipschitz constant of the network. Penalizing by L+M, would restrict the search space so that CNI is well behaved. Such combinations of penalizers have been reported in the literature (Plaut, Nowlan, & Hinton, 1986; Sietsma & Dow, 1991) to be successful. However, it seems quite difficult to compare analytically the respective advantages of penalization alone and penalization supplemented with NI. This would require establishing lower rate of convergence for one of the techniques. Such rates are not available to our knowledge. 4 Conclusion and Perspective Under a gaussian distribution assumption, NI can be cast as a stochastic alternative to a regularization of the empirical cost by a partial differential equation. This is more likely to facilitate a rigorous analysis of the NI heuristic than to be a source of efficient algorithms. It allows the assignment of a precise meaning to the concept of validity of Taylor approximations of the perturbed cost function. For sigmoidal neural networks, and more generally for classes of functions that can be defined using exponentials and polynomials, those Taylor approximations are valid in the sense that minimizing CTaylor and CNI turns out to be equivalent in the infinitesimal variance limit. The practical relevance of this equivalence remains questionable since practical uses of NI require finite noise variance. The relationship between NI and the heat equation enables establishing a simple bridge between regularization and the transformation of Cemp into CNI : the contractivity of the heat semigroup ensures that the regularized version of the effective cost is easier to estimate than the original effective cost. Finally, because practical m of CNI applications are more likely to minimize empirical versions CNIemp than CNI itself, we resort to results from the theory of stochastic calculus to m . provide bounds on the tail probability of deviations of CNIemp Despite the clarification provided by the heat connection, this is far from being the last word. A lot of quantitative works needs to be done if theory is to meet practice. The global bounds provided in section 3.3 need to be complemented by a local analysis focused on the neighborhood of the critical points of Cemp . Ultimately this should provide rates of convergence for specific F and classes of target dependencies E(Y|x). Because practitioners often use stochastic versions of backpropagation, it will be interesting to see whether the approach advocated here can refine the results presented in An (1995). NI is a naive and rather conservative way of trying to enforce translation invariance while training neural networks. Other, and possibly more interesting, forms of invariance under transformation groups deserve to be examined in the NI perspective, as proposed by Leen (1995). Transformation groups can often be provided with a Riemannian manifold structure on which some heat kernels may be defined; thus it is appealing to check
1106
Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron
whether the generalizations of the tools presented here (the theory of diffusion on manifolds; cf. Ledoux & Talagrand, 1991, for some results) can facilitate the analysis of NI versions of tangent-prop–like algorithms. Appendix: From Heat Equation to Noise Injection This appendix rederives the relation between the heat equation and NI using Fourier analysis techniques. To put the idea in the simplest setting, the problem is treated in one dimension. The heat equation under concern is: ∂u = 12 ∆xx u, ∂t ¶2 ` µ 1X f (xi ) − yi . u(0, x) = ` i=1 The Fourier transform of the heat equation is: 1 ∂b u(t, ξ ) = − ξ2 b u(t, ξ ). ∂t 2
(A.1)
This is an ordinary differential equation whose solution is: µ 2 ¶ ξ t b , u(t, ξ ) = K(ξ ) exp − 2
(A.2)
where the integration constant K(ξ ) is the Fourier transform of the initial condition; thus: µ 2 ¶ ξ t b . u(t, ξ ) = b u(0, ξ ) exp − 2
(A.3)
The inverse Fourier transform maps the product into a convolution; this entails: µ µ 2 ¶¶ ξ t . u(t, x) = u(0, t) ∗ F−1 exp − 2
(A.4)
Thus, we recover the definition of CNI : CNI ( f, t) = Cemp ( f ) ∗ N(t),
(A.5)
where N(t) is the density of a centered normal distribution with variance t.
Noise Injection
1107
Acknowledgments We thank two anonymous referees for helpful comments on a draft of this article, particularly for suggesting that NI or derivative-based regularizers should be combined with global smoothers. Part of this work has been done while St´ephane Boucheron was visiting the Institute for Scientific Interchange in Torino. Boucheron has been partially supported by the GermanFrench PROCOPE program 96052. References An, G. (1995). The effects of adding noise during backpropagation training on generalization performance. Neural Computation 8:643–674. Auer, P., Hebster, M., & Warmuth, M. K. (1996). Exponentially many local minima for single neurons. Advances in Neural Information Processing Systems 8:316–322. Barron, A., Birge, L., & Massart, P. (1995). Risk bounds for model selection via penalization (Tech. Report). Universit´e Paris-Sud. http://www.math.upsud.fr/stats/preprints.html. Bishop, C. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation 7(1):108–116. Comon, P. (1992). Classification supervis´ee par r´eseaux multicouches. Traitement du Signal 8(6):387–407. Devroye, L. (1987). A Course in Density Estimation. Basel: Birkh¨auser. Ethier, S., & Kutrz, T. (1986). Markov Processes. New York: Wiley. Grandvalet, Y. (1995). Effets de l’injection de bruit sur les perceptrons multicouches. Unpublished doctoral dissertation, Universit´e de Technologie de Compi`egne, Compi`egne, France. Grandvalet, Y. & Canu, S. (1995). A comment on noise injection into inputs in back-propagation learning. IEEE Transactions on Systems, Man, and Cybernetics 25(4):678–681. Grenander, U. (1981). Abstract Inference. New York: Wiley. Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation 100:78– 150. Hirsch, M. (1976). Differential Topology. New York: Springer-Verlag. Holmstrom, ¨ L., & Koistinen, P. (1992). Using additive noise in back-propagation training. IEEE Transactions on Neural Networks 3(1):24–38. Karatzas, I., & Shreve, S. (1988). Brownian Motion and Stochastic Calculus (2nd ed.). New York: Springer-Verlag. Ledoux, M., & Talagrand, M. (1991). Probability in Banach Spaces. Berlin: SpringerVerlag. Leen, T. K. (1995). Data distributions and regularization. Neural Computation 7:974–981. Matsuoka, K. (1992). Noise injection into inputs in backpropagation learning. IEEE Transactions on Systems, Man, and Cybernetics 22(3):436–440.
1108
Yves Grandvalet, St´ephane Canu, and St´ephane Boucheron
Plaut, D., Nowlan, S., & Hinton, G. (1986). Experiments on learning by back propagation (Tech. Rep. CMU-CS-86-126). Pittsburgh, PA: Carnegie Mellon University, Department of Computer Science. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical Recipes in C. Cambridge: Cambridge University Press. Reed, R., Marks II, R., & Oh, S. (1995). Similarities of error regularization, sigmoid gain scaling, target smoothing and training with jitter. IEEE Transactions on Neural Networks 6(3):529–538. Sietsma, J., & Dow, R. (1991). Creating artificial neural networks that generalize. Neural Networks 4(1):67–79. Sontag, E. D. (1996). Critical points for least-squares problems involving certain analytic functions, with applications to sigmoidal neural networks. Advances in Computational Mathematics 5:245–268. Tikhonov, A. N., & Arsenin, V. Y. (1977). Solution of Ill-Posed Problems. Washington, DC: W. H. Wilson. Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data. New York: Springer-Verlag. Webb, A. (1994). Functional approximation by feed-forward networks: A leastsquares approach to generalization. IEEE Transactions on Neural Networks 5(3):363–371.
Received February 1, 1996; accepted September 4, 1996.
Communicated by Marwan Jabri
The Faulty Behavior of Feedforward Neural Networks with Hard-Limiting Activation Function Zhiyu Tian Department of Automation, Tsinghua University, Beijing 100084, People’s Republic of China
Ting-Ting Y. Lin Department of Electrical and Computer Engineering, University of California, San Diego, La Jolla, California 92093-0407, U.S.A.
Shiyuan Yang Shibai Tong Department of Automation, Tsinghua University, Beijing 10084, People’s Republic of China
With the progress in hardware implementation of artificial neural networks, the ability to analyze their faulty behavior has become increasingly important to their diagnosis, repair, reconfiguration, and reliable application. The behavior of feedforward neural networks with hardlimiting activation function under stuck-at faults is studied in this article. It is shown that the stuck-at-M faults have a larger effect on the network’s performance than the mixed stuck-at faults, which in turn have a larger effect than that of stuck-at-0 faults. Furthermore, the fault-tolerant ability of the network decreases with the increase of its size for the same percentage of faulty interconnections. The results of our analysis are validated by Monte-Carlo simulations. 1 Introduction Artificial neural networks (ANN) are generally claimed to be intrinsically reliable and fault tolerant in the neural network literature (Hecht-Nielsen, 1988; Lisboa, 1992). With the advance research on their hardware implementation, the analysis of ANN’s faulty behavior becomes increasingly important for their diagnosis, repair, and reconfiguration. Faulty characteristics of feedback ANNs were discussed in Nijhuis and Spaanenburg (1989); Belfore and Johnson (1991); Chung and Krile (1992); and Protzel, Palumbo, and Arras (1993). The effect of weight disturbances on the performance degradation of feedforward ANNs has been examined by Stevenson, Winter, and Widrow (1990); Choi and Choi (1992); and Piche (1992). However, none of the feedforward analysis adopted the stuck-at fault Neural Computation 9, 1109–1126 (1997)
c 1997 Massachusetts Institute of Technology °
1110
Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong
model, which has been a popular model used in the analysis of feedback network interconnections. The use of stuck-at fault model in characterizing feedforward ANNs is the focus here. In this article, the ANN modeled has L layers, with the first layer being the input layer, the Lth layer being the output layer, and the L − 2 layers in between being the hidden layers. Each node in the output layer and the hidden layers receives the outputs from all the nodes in the preceding layer as its inputs, and produces an output as follows: Ãn ! `−1 X wi xi + θj (1.1) j = 1, . . . , n` , yj = Sgn i=1
where n` is the number of nodes in the `th layer (` = 2, . . . , L); xi (i = 1, . . . , n`−1 ) is the ith input of the node, which has only two values: +1 and −1; wi (i = 1, . . . , n`−1 ) is the weight of the ith input connection, with values in the range of [−M, +M]; and θ is the bias, having values in the same range as wi . When nl−1 is large, the effect of θ can often be neglected. Three types of stuck-at faults are considered: stuck-at-M (i.e., stuck-at-(−M)), stuck-at-0, and a mixture of s-a-M (−M) and s-a-0. Specifically, interconnection faults are represented with either M (−M) or 0, replacing the corresponding wi xi . Our work focuses on the general effect of faults at the output, while the inputs and weights are random variables assumed to be independent identically distributed (iid), and the means of the inputs and weights are 0. These assumptions support the fact that these neural networks can be used in various applications, thus assuming any specific set of values for input and weights would not be appropriate. The network is evaluated using the probability that the output is correct when faults occur as its performance measure. The higher the probability, the more fault tolerant. Section 2 presents the effect of interconnection faults on the performance of an individual neuron. Section 3 shows the effect of erroneous inputs on the performance of an individual neuron. Section 4 combines the effects of both erroneous inputs and faulty interconnections on the performance of an individual neuron. The behavior of faulty feedforward networks with a hard-limiting activation function is presented in section 5. And the performance of the network when s-a-M, s-a-(−M), and s-a-0 faults exist is analyzed in section 6. Sections 7 and 8 conclude the article. 2 The Effect of Interconnection Faults on the Performance of an Individual Neuron We begin with a fault-free case, where the output of a neuron, neglecting the bias θ , is: n P wi xi ≥ 0, 1 i=1 (2.1) y= n P wi xi < 0, −1 i=1
Faulty Behavior of Feedforward Neural Networks
1111
where n is the number of inputs. In case of n f faults, let F be the set of faulty interconnections and S be the set of normal ones; the output becomes:
y0 =
1 −1
P i∈S P
wi xi + n f M ≥ 0, (2.2)
wi xi + n f M < 0.
i∈S
The output error, 1y, defined as the difference of the respective outputs is:
0
1y = y − y =
2 −2 0
P
wi xi + n f M ≥ 0,
i∈S
P
wi xi + n f M < 0,
n P i=1 n P
wi xi < 0, wi xi ≥ 0,
(2.3)
i=1
i∈S
otherwise.
P Pn Let event P A be i=1 wi xi i∈F wi xi , we have i=1 wi xi < i∈S wi xi + n f M, which means that B implies A (i.e., B ⊂ A). We therefore have Pr {AB} = Pr {B}
(2.4)
Pr {1y = 2} = Pr {AB} = Pr {A} − Pr {AB} = Pr {A} − Pr {B}
(2.5)
Pr {1y = −2} = Pr {A B} = Pr {B} − Pr {AB} = 0,
(2.6)
where Pr {•} is the occurrence probability of Event •. Plugging in event B, we obtain )
( Pr {B} = Pr
X
wi xi < −n f M .
(2.7)
i∈S
We assume that wi and xi (i = 1, . . . , n) are iid random variables; the mean and the variance of wi are 0 and σw2 , respectively. When the means of the weights and inputs are not zero, the static offset will not affect the result of our analysis. If xi is either +1 or −1, E[x2i ] = 1 (i = 1, . . . , n). Furthermore, 2 is: the mean of wi xi would be 0, and its variance σwx 2 σwx = E[(wi xi )2 ] − {E[wi xi ]}2 = E[w2i ]E[x2i ]
= σw2 + (E[wi ])2 = σw2
(2.8)
To solve for Pr {B}, we use theP central limit theorem, which implies that when p n − n f is sufficiently large, ( i∈S wi xi )/( n − n f σwx ) approximates normal
1112
Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong
distribution N(0, 1). Equation 2.7 can now be written as: P wi xi M n f i∈S <−p Pr {B} = Pr p n − n f σw n − n f σw Z
−√
nf M
n−n f σw
≈
u2 1 √ e− 2 du. 2π
(2.9)
Equation 2.9 is now plugged into equations 2.5 and 2.6 to get the probability of the output error: Z − √nf M u2 1 1 n−n f σw Pr {1y 6= 0} = Pr {1y = 2} = − √ e− 2 du 2 2π −∞ Z √nf M u2 1 n−n f σw = √ e− 2 du. 2π 0
(2.10)
Therefore, the probability that a neuron will produce a correct output, Pc , is: Pc = Pr {1y = 0} = 1 −
Z √nf M n−n σ
f w
0
u2 1 √ e− 2 du. 2π
(2.11)
Let P = n f /n; equation 2.11 becomes √ P nM √ 1−Pσw
Z Pc = 1 − 0
u2 1 √ e− 2 du. 2π
(2.12)
It can be observed that Pc is a monotonically decreasing function of P. We performed Monte Carlo simulations 10,000 times for each point of wi (i = 1, . . . , n), with n = 50, 100, and 200, uniformly distributed in the range [−1, 1] on a Compaq 486 computer and contrasted to that obtained with equation 2.12 in Figure 1. It is shown that equation 2.12 is an accurate representation of the s-a-M faulty behavior of a neuron. We consider the effects of s-a-0 interconnetion faults. When such faults occur, the output error is:
1y =
2 −2 0
n P i=1 n P
wi xi < 0,
P
wi xi ≥ 0,
i∈S
wi xi ≥ 0,
i=1
otherwise.
P
i∈S
wi xi < 0,
(2.13)
Faulty Behavior of Feedforward Neural Networks
1113
Figure 1: Performance of individual neurons under s-a-M faults.
Let
P s= p
P
wi x i
i∈S
n − n f σw
,
wi xi i∈E t= √ , n f σw
then Pr {1y = 2} = Pr {1y = −2} o np p p = Pr n − n f σw s + n f σw t < 0, n − n f σw s ≥ 0 . (2.14) According to the central limit theorem, if both n − n f and n f are sufficiently large (in this context, four or more can be considered sufficiently large), both s and t approximate normal distribution. Therefore equation 2.14 can be reduced to: ZZ 1 − s2 +t2 e 2 ds dt, (2.15) Pr {1y = 2} ≈ 2π D
p p √ where D is the region satisfying n − n f σw s+ n f σw t < 0 and n − n f σw s ≥ 0. From equation 2.15 by means of polar coordinate, we get: s nf 1 tan−1 , (2.16) Pr {1y = 2} = 2π n − nf where the symbol tan−1 stands for arc tangent.
1114
Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong
Figure 2: Performance of individual neurons under s-a-0 faults.
Hence, the probability that a neuron will produce a correct output is 1 Pc = 1 − Pr {1y = 2} − Pr {1y = −2} = 1 − tan−1 π r P 1 , = 1 − tan−1 π 1−P
s
nf n − nf (2.17)
with P = n f /n. This is plotted against that from the Monte Carlo simulation in Figure 2. Although Pc is a monotonically decreasing function of P, Pc is never smaller than 0.5. This is due to the fact that for a neuron with a hard-limiting activation function, there are only two possible outputs: +1 and −1. Even if all the input interconnections are faulty, the output will assume one of the two values; thus, there is 50 percent chance that the erroneous output coincides with the correct output. 3 The Effect of Erroneous Inputs on the Performance of an Individual Neuron For the study of the effect of erroneous inputs on the performance of an individual neuron, we assume that there are ne erroneous inputs among all the n inputs of a neuron. Let R be the set of correct inputs and E be the set of erroneous inputs. The output error of the neuron under erroneous inputs
Faulty Behavior of Feedforward Neural Networks
is:
1y =
Let
2
n P
−2 0 P
i=1 n P
wi xi < 0,
P
wi xi −
i∈R
wi xi ≥ 0,
P
wi xi ≥ 0,
i∈E
wi xi −
i∈R
i=1
P
1115
P
(3.1)
wi xi < 0
i∈E
otherwise. P
wi xi
i∈R
, s= √ n − ne σw
wi xi i∈E t= √ , ne σw
when n − ne and ne are sufficiently large, both s and t approximate normal distribution N(0, 1), and the probability of incorrect output is Pr {1y = 2} = Pr {1y = −2} √ √ √ = Pr { ne σw t < n − ne σw s ≤ − ne σw t} √ √ √ = Pr { ne t ≤ n − ne s < − ne t} ZZ 1 − s2 +t2 = e 2 ds dt, 2π
(3.2)
D
√ √ √ where D is the region satisfying ne t ≤ n − ne s < − ne t. Through double integration we get: r ne 1 −1 . (3.3) Pr {1y = 2} = tan π n − ne Hence, 2 Pc = 1 − Pr {1y = 2} − Pr {1y = −2} = 1 − tan−1 π r 2 P , = 1 − tan−1 π 1−P
r
ne n − ne (3.4)
with P being ne /n. Figure 3 compares equation 3.4 with that from the Monte Carlo simulation, with wi (i = 1, . . . , n) uniformly distributed in the range [−1, 1], and 10,000 runs for each point. It is observed that Pc is a monotonically decreasing function of P. 4 The Combined Effect of Both Erroneous Inputs and Faulty Interconnections on the Performance of an Individual Neuron We now consider the case that there are ne erroneous inputs (the set of correct inputs is R and the set of incorrect inputs is E) and n f interconnection
1116
Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong
Figure 3: Performance of individual neurons under input errors.
faults (the set of normal connections is S, while the set of faulty connections is F) among the n inputs to a neuron. This situation can be considered a combination of two sequential events: ne erroneous inputs, followed by n f of the input connections, have stuck-at faults. Based on equation 3.1, let Event A be: Event B be:
n P i=1 P
wi xi < 0. P wi xi − wi xi < 0.
i∈R
i∈E
(4.1)
Event C be: when n f stuck-at faults occur after erroneous inputs appear, the output y0 < 0. The output error caused by both faulty inputs and interconnections is: AC 2 1y = −2 AC (4.2) 0 otherwise. Pr {1y = 2} = Pr {AC} = Pr {ABC} + Pr {AB C} = Pr {AB}Pr {C/AB} + Pr {AB}Pr {C/AB}.
(4.3)
Event C occurs primarily due to B or B, and its relationship with A is neglible; thus we approximate in the following: Pr {C/AB} ≈ Pr {C/B} =
Pr {BC} = 2Pr {BC}. Pr {B}
(4.4)
Faulty Behavior of Feedforward Neural Networks
Pr {C/AB} ≈ Pr {C/B} = 2Pr {B C}.
1117
(4.5)
From equations 2.5 and 3.3 we have: 1 1 Pr {AB} = − tan−1 2 π 1 Pr {AB} = tan−1 π
r
r
ne , n − ne
(4.6)
ne . n − ne
(4.7)
If the type of the input connection faults is s-a-M, according to equations 2.10 and 2.6, Pr {BC} =
Z √nf M n−n σ
f w
0
u2 1 √ e− 2 du, 2π
Pr {B C} = Pr {B} − Pr {BC} = Pr {B} =
(4.8) 1 . 2
(4.9)
Substituting equations 4.4 through 4.9 into 4.3, we derive the following equation: r ¶ Z √nf M u2 1 ne 1 1 n−n f σw −1 − tan Pr {1y = 2} = 2 √ e− 2 du 2 π n − ne 2π 0 r n 1 e . (4.10) + tan−1 π n − ne µ
Similarly, Pr {1y = −2} = Pr {AC} = Pr {ABC} + Pr {A B C} = Pr {AB}Pr {C/B} + Pr {A B}Pr {C/B} r Z √nf M u2 n 1 1 2 n−n σ e w f − = tan−1 √ e− 2 du . (4.11) π n − ne 2 2π 0 Hence Pr {1y 6= 0} = Pr {1y = 2} + Pr {1y = −2} r Z √nf M Z √nf M 1 − u2 ne 1 − u2 4 n−n f σw n−n f σw −1 e 2 du = √ e 2 du − tan π n − ne 0 2π 2π 0 r ne 2 . (4.12) + tan−1 π n − ne
1118
Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong
If the interconnection faults are s-a-0 faults, equations 4.8 and 4.9 become 1 tan−1 Pr {BC} = 2π
s
nf , n − nf
(4.13)
1 1 Pr {B C} = Pr {B} − Pr {BC} = − tan−1 2 2π
s
nf . n − nf
(4.14)
Following the method similar to that used in deriving equation 4.12, we get: 1 Pr {1y 6= 0} = π
Ã
s −1
tan
nf + 2 tan−1 n − nf
4 − tan−1 π
s
r
nf tan−1 n − nf
ne n − ne
r
ne n − ne
! .
(4.15)
5 The Faulty Behavior of Feedforward Networks with Hard-Limiting Activation Function We will study the faulty behavior of feedforward networks with hardlimiting activation function. If P f of the input connections to all the nodes in the hidden layers and the output layer are faulty, then for every node in the `th layer (` = 2, . . . , L): n = n`−1
` = 2, . . . , L,
(5.1)
n f = P f • n`−1
` = 2, . . . , L,
(5.2)
Pr {ne = k} = Ckn`−1 PEk`−1 (1 − PE`−1 )n`−1 −k
` = 2, . . . , L,
(5.3)
where PE` is the probability that the output of a node in the `th layer is incorrect. For the first hidden layer, ne = 0; for the layers that follow, ne 6= 0. Then we must use equation 4.12 or 4.15 to compute Pr {1y 6= 0}. q When
we examine equations 4.12 through 4.15, we find that α1 = tan−1
ne n−ne
is
present in both equations. Since ne is a random variable taking value k with a probability expressed in equation 5.3, α1 is also a random variable. For each ne = 0, 1, . . . , n, the mean of α1 is E[α1 ] =
n X k=−0
s tan−1
k • Pr {ne = k}. n − nk
(5.4)
Faulty Behavior of Feedforward Neural Networks
1119
Figure 4: Faulty performance of the network when every node has the same fraction of faulty interconnections.
If we substitute α1 with its mean into equation 4.12 or 4.15, PE` = Pr {1y 6= 0} can be derived. This procedure can be repeated until ` = L to derive PE` of the output layer. However, if n`−1 is large, the calculation of α1 can be tedious for each ne = k (k = 0, 1, . . . , n`−1 ). We can then approximate this calculation by using the expected value of ne , n¯ e = PE`−1 n`−1
` = 2, . . . , L,
(5.5)
for ne . Based on this approximation, we obtained the performance curves of Pc and compared them to the simulation results of two types of threelayered networks (100, 100, 1, and 200, 200, 1) with s-a-M faults in Figure 4. Two thousand simulations are performed for each point for this and later figures. The error is within an acceptable range. In practice, only the percentage of the faulty interconnections is known a priori, while the distribution among all network interconnections is arbitrary. In that case, we use the expected value of n f , n f = P f • n`−1
` = 2, . . . , L
(5.6)
for n f in equation 4.12. Both the theoretical and simulation results of Pc of three types of three-layered networks with s-a-M faults are shown in Figure 5. Furthermore, we generated the performance curves for the three-layered and five-layered networks under s-a-0 faults and compared them to the
1120
Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong
Figure 5: Performance of three-layered networks under s-a-M faults.
simulation results in Figure 6. Although the error is larger than that in Figure 4, it gradually decreases as the network size increases. In addition, both the theoretical and the simulation results follow the same curvature. 6 The Network Performance When All Three Types of Stuck-At Faults Exist In the previous sections, we analyzed network performance in the presence of only one type of the stuck-at faults. In reality, more than one type of fault may exist simultaneously. We study the case of multiple faults given that a fault could be an s-a-M(−M) fault with a probability q (0 < q ≤ 1/2), or an s-a-0 fault with a probability 1 − 2q. Thus, wi xi , depending on the faulty interconnections, will be replaced by a random variable zi (i ∈ F), which could take values M, −M, and 0 with probabilities q, q, and 1 − 2q, respectively. We also derive that E[zi ] = 0 and Var[zi ] = 2M2 q. The output of an individual neuron with n f faulty interconnections is:
y0 =
1 −1
P i∈S P i∈S
wi xi + wi xi +
P i∈F P i∈F
zi ≥ 0, zi < 0.
(6.1)
Faulty Behavior of Feedforward Neural Networks
1121
Figure 6: Performance of networks under s-a-0 faults.
The output error is:
1y =
2
n X
−2 0
i=1 n P
wi xi < 0,
X
wi xi +
i=1
P
zi ≥ 0,
i∈F
i∈S
wi xi ≥ 0,
X
wi xi +
P
(6.2)
zi < 0,
i∈F
i∈S
otherwise.
Therefore,
Pr {1y = 2} = Pr
( n X
wi xi < 0,
i=1
X i∈S
wi xi +
X
) zi ≥ 0 .
(6.3)
i∈F
Let P s= p
P
wi x i
i∈S
n − n f σw
,
i∈F
wi xi
t= √ , n f σw
P
zi
i∈F
, u= p 2n f pM
according to the central limit theorem, when n − n f and n f are sufficiently
1122
Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong
large and s, t, and u all follow normal distribution N(0, 1). Therefore: Pr {1y = 2} q o np p p n − n f σw s + 2n f qMu ≥ 0, n − n f σw s + n f σw t < 0 = Pr ZZZ = Ä
1 2 1 2 2 e− 2 (s +t +u ) ds dt du, (2π)3/2
(6.4)
p p where Ä is the three-dimensional region satisfying n − n f σw s+ 2n f pMu ≥ p √ 0 and n − n f s + n f t < 0. Let us imagine this region is like a huge sphere centered at the origin with radius R → ∞. Because the two boundary planes intersect at the origin, the region Ä can be regarded as a lune whose angle is the dihedral angle between the two planes, which is: s nf 2nn f qM2 + α = tan−1 n − nf (n − n f )2 σw2 s 2P f qM2 Pf = tan−1 + . (6.5) 1 − Pf (1 − P f )2 σw2 We can rotate the coordinate so that the U axis coincides with the intersection line. Thus, by employing cylindrical coordinate we have: Z ∞ Z αZ ∞ u2 +r2 1 e− 2 r dr dθ du Pr {1y = 2} = 3 −∞ 0 0 (2π ) 2 α . (6.6) = 2π Hence the probability of incorrect output is Pr {1y 6= 0} = Pr {1y = 2} + Pr {1y = −2} =
α . π
(6.7)
Figure 7 shows the comparison between the theoretical and the simulation results of a neuron’s performance under such mixed faults. It is shown that the simulation follows closely with equation 6.7. Now we proceed to study the combined effect of erroneous inputs and the mixed stuck-at faults on the performance of a neuron. In the case of mixed faults, equations 4.8 and 4.9 become: Pr {BC} =
α 2π
(6.8)
Pr {B C} =
α 1 − . 2 2π
(6.9)
Faulty Behavior of Feedforward Neural Networks
1123
Figure 7: Performance of individual neurons under mixed faults.
Therefore, the probability of incorrect output is: Pr {1y 6= 0} = 2Pr {1y = 2} r r µ ¶ ne ne 4 1 −1 −1 α + 2 tan − α tan = . (6.10) π n − ne π n − ne Following the steps described in section 5, we can calculate the faulty performance of the network. Figure 8 shows the performance curves for a three-layered network (100, 100, 1), a four-layered network (100, 100, 100, 1), and a five-layered network (100, 100, 100, 100, 1) (q = 1/3). From Figure 7, we see that the performance of an individual neuron degrades with the increase of q. This implies that the s-a-M (s-a-(−M)) faults have a larger effect on the network’s performance than s-a-0 faults. 7 Discussion It was believed that the fault-tolerance ability of ANNs improves with the growth of network size (Nijhuis & Spaanenburg, 1989), which has been shown the opposite for the feedforward ANNs with hard-limiting activation functions when considering the fraction (instead of the number) of faulty interconnections. This was demonstrated initially in Figure 5 for networks with the same number of layers: the more hidden nodes, there are the faster the performance degrades with the increased percentage of s-a-M interconnections. Furthermore, Figure 9 shows the performance curves for
1124
Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong
Figure 8: Performance of ANNs under mixed stuck-at faults.
networks with different numbers of layers but the same number of nodes in every layer. Tolerance for s-a-M faults is weakened as layers increase. When comparing two networks with the same total number of hidden nodes, as shown in Figure 10, the network with fewer layers (three layers: 100, 100, 1) is more fault tolerant than that with more layers (100, 50, 50, 1). The same conclusion can be drawn for s-a-0 and mixed faults, as shown in Figure 6 and 8. This suggests that ANN designers should make every effort to limit the number of layers and the number of nodes in every layer for better fault tolerance in feedforward neural networks with hard-limiting activation functions, provided that other performance requirements are satisfied. 8 Conclusion In this article, we have analyzed the behavior of feedforward neural networks with hard-limiting activation function under stuck-at faults. The relationship between the network performance and the number of faulty interconnections, together with the method for analyzing the faulty behavior, is also presented. Through comparison with Monte Carlo simulation results, the correctness of our formulation for individual neurons is verified. Furthermore, our results demonstrate that increasing the size of a feedforward neural network will lower its ability to tolerate stuck-at faults when
Faulty Behavior of Feedforward Neural Networks
1125
Figure 9: s-a-M faulty performance of networks with different number of layers.
Figure 10: Comparison between the faulty performance of networks with the same total number of hidden nodes but different number of hidden layers.
the fraction of faults is considered and that s-a-M (s-a-(−M)) faults have a larger effect on the network than s-a-0 faults. These conclusions are extremely useful in the design of feedforward networks.
1126
Zhiyu Tian, Ting-Ting Y. Lin, Shiyuan Yang, and Shibai Tong
References Belfore, L. A., & Johnson, B. W. (1991). The analysis of the faulty behavior of synchronous neural networks. IEEE Trans. on Computers 40: 1424–1429. Choi, J. Y., & Choi, C.-H. (1992). Sensitivity analysis of multilayer Perceptron with differentiable activation functions. IEEE Trans. on Neural Networks 3:101– 107. Chung, P.-C., & Krile, T. F. (1992). Characteristics of Hebbian-type associative memories having faulty interconnections. IEEE Trans. on Neural Networks 3:969–980. Hecht-Nielsen, R. (1988). Neurocomputing: Picking the human brain. IEEE Spectrum 25:36–41. Lisboa, P. G. J. (1992). Neural networks. London: Chapman Hall. Nijhuis, J. A. G., & Spaanenburg, L. (1989). Fault tolerance of neural associative memories. IEEE Proceedings 136 (pt. E): 389–394. Piche, S. (1992). Robustness of feedforward neural networks. In Proceedings of IJCNN (pp. 346–351). Baltimore, MD. Protzel, P. W., Palumbo, D. L., & Arras, M. K. (1993). Performance and faulttolerance of neural networks for optimization. IEEE Trans. Neural Networks 4:600–614. Stevenson, M., Winter, R., & Widrow, B. (1990). Sensitivity of feedforward neural networks to weight errors. IEEE Trans. on Neural Networks 1:71–80.
Received February 24, 1994; accepted September 4, 1996.
Communicated by Bill Horne
Analysis of Dynamical Recognizers Alan D. Blair Jordan B. Pollack Department of Computer Science, Volen Center for Complex Systems, Brandeis University, Waltham, MA 02254-9110, USA
Pollack (1991) demonstrated that second-order recurrent neural networks can act as dynamical recognizers for formal languages when trained on positive and negative examples, and observed both phase transitions in learning and interacted function system–like fractal state sets. Followon work focused mainly on the extraction and minimization of a finite state automaton (FSA) from the trained network. However, such networks are capable of inducing languages that are not regular and therefore not equivalent to any FSA. Indeed, it may be simpler for a small network to fit its training data by inducing such a nonregular language. But when is the network’s language not regular? In this article, using a low-dimensional network capable of learning all the Tomita data sets, we present an empirical method for testing whether the language induced by the network is regular. We also provide a detailed ε-machine analysis of trained networks for both regular and nonregular languages. 1 Introduction Pollack (1991) showed one way a recurrent network may be trained to recognize formal languages from examples. The resulting networks often displayed complex limit dynamics, which were fractal in nature (Kolen, 1993). Alternative architectures had been employed earlier for related tasks (Jordan, 1986; Pollack, 1987; Cleeremans, Servan-Schreiber, & McClelland, 1989). Others have been proposed since (Watrous & Kuhn, 1992; Frasconi, Gori, & Soda, 1995; Zeng, Goodman, & Smyth, 1994; Das & Mozer, 1994; Forcada & Carrasco, 1995), and a number of approaches to analyzing recurrent networks have been developed. One of the principal themes has been the use of clustering and minimization techniques to extract a finite state automaton (FSA) that approximates the network’s behavior (Giles et al., 1992; Manolios & Fanelli, 1994; Tino ˇ & Sajda, 1995). Casey (1996) showed that if the network robustly models an FSA, the method proposed in Giles et al. (1992) will successfully extract the FSA given a fine enough resolution. In many cases, however, the language induced by the network is not regular, and therefore it cannot be modeled exactly by any FSA. In order to analyze the behavior of these networks, we need a Neural Computation 9, 1127–1142 (1997)
c 1997 Massachusetts Institute of Technology °
1128
Alan D. Blair and Jordan B. Pollack
Figure 1: System architecture.
method for testing empirically whether the network’s behavior is regular (i.e., it is modeling and FSA). In this article, we show how to perform an analysis of the trained network at high resolution by constructing a series of finite state approximations that describe its behavior at progressively more refined levels of detail. With this analysis we are able to measure empirically when the network has crossed over the boundary between regular and nonregular behavior. 2 Architecture Our system comprises a gated recurrent neural network with two subnetworks W0 and W1 , a gating device, and a perceptron P (see Figure 1). For jk σ = 0, 1 we denote by Wσ the real-valued weights of subnetwork Wσ and by wσ the function it implements. To operate the network, we first feed the initial vector x0 into the input layer. Then, as each symbol of a binary string 6 = σ1 , . . . , σn is read in, a gating device chooses one of the two subnetworks W0 or W1 , depending on whether the next symbol σt is equal to 0 or 1. The current real-valued
Analysis of Dynamical Recognizers
1129
hidden unit state vector xt−1 is then fed through this subnetwork Wσt to produce the next state xt = wσt (xt−1 ), given by à ! d X j j0 jk k Wσt xt−1 , for 1 ≤ j ≤ d, 1 ≤ t ≤ n. xt = tanh Wσt + k=1
This part of the architecture is equivalent to the second-order recurrent networks used in Pollack (1991) and Giles et al. (1992), with a slightly different notation. After the whole string has been read, the final state xn is fed through a perceptron P producing output d X j Pj xn , z = tanh P0 + j=1
where each Pj is a real-valued weight and z is the real-valued output. To make the geometry more accessible, we use the interval [−1, 1 ] for network activations and adopt the hyperbolic tangent as our transfer function, which is equivalent by rescaling to the usual sigmoid function. The object of training is that the output z should be close to +1 for accept strings and close to −1 for reject strings (“close” meaning within, say, 0.4). The class of languages recognizable by such networks lies somewhere between the regular and recursive language families and is known to contain some languages that are not context free (Siegelmann & Sontag, 1992). In the work reported here the state-space dimension d was always equal to 2, which gives the model a total of 17 free parameters: 3 for the perceptron, 6 for each subnetwork, and 2 for the initial point. As outlined below, this architecture was adequate to the task of learning any of the data sets from Tomita (1982) within a few hundred training epochs. 3 Analysis Our purpose here is not only to train the network but also to gain some insight into how it accomplishes its assigned task (see Casey, 1996, for another, related approach to this problem). If the recurrent layer has d nodes, the state space of the system is the d-dimensional hypercube X = [−1, 1]d . However, we need not be concerned with the whole state space, but only with that portion of it, which we call A0 , that is accessible from the initial point x0 via repeated applications of the maps w0 and w1 . Formally, we may define A0 thus: 1. Y0 = {x0 }. 2. For i ≥ 0,
Yi+1 = Yi ∪ w0 (Yi ) ∪ w1 (Yi )
3. A0 = lim Yi . i→∞
(Note: Yi+1 ⊇ Yi ). (3.1)
1130
Alan D. Blair and Jordan B. Pollack
Once we have identified the space A0 on which the dynamics take place, the next step is to find an appropriate subdivision of this space—a finite collection of disjoint nonempty subsets covering A0 —that is compatible with the maps w0 and w1 . The coarsest possible subdivision is provided by the perceptron, which divides A0 into the subset Uacc of “accept” points and the subset Urej of “reject” points. So we take as our first subdivision S1 = {Uacc , Urej }. If S = {U1 , . . . , Um } is a subdivision, the preimage of S under w0 is −1 −1 w−1 0 (S ) = {w0 U1 , . . . , w0 Um } \ {∅},
where w−1 0 Uj = {x ∈ A0 | w0 (x) ∈ Uj } (and similarly for w−1 1 (S )). The join R ∨ S of two partitions R and S is
R ∨ S = {U ∩ V | U ∈ R, V ∈ S } \ {∅}. Using these tools, we define a series of subdivisions {Si }i≥1 by 1. S1 = {Uacc , Urej }. −1 2. For i ≥ 1, Si+1 = Si ∨ w−1 0 (Si ) ∨ w1 (Si ).
We are now ready to construct a series of nondeterministic FSAs {Mi }i≥1 that approximate the behavior of the network. For each i, let Mi be as follows: 1. The states of Mi are indexed by the elements Uj in the subdivision Si . 2. The initial state is that Uj which contains x0 . 3. The accept states are those Uj contained in Uacc . 4. An edge labeled by 0 connects Uj to Uk exactly when Uj ∩ w−1 0 Uk 6= ∅. 5. An edge labeled by 1 connects Uj to Uk exactly when Uj ∩ w−1 1 Uk 6= ∅. Each FSA in the series is a refinement of the previous one, which sketches out some of the details of its nondeterminism. Proposition. If Mi is deterministic for some i, then Mj will equal Mi for all j > i and Mi will exactly model the network’s behavior. Proof. Suppose Mi is deterministic for some i and consider the input string 6 = σ1 , . . . , σn , with σj ∈ {0, 1} for 1 ≤ j ≤ n. We must show that 6 is accepted by Mi if and only if it is accepted by the network. Recall that x0 denotes the initial point, and let xj = wσj (xj−1 ), for 1 ≤ j ≤ n. Recall that the states of Mi are indexed by the elements Uk of subdivision Si , and for 0 ≤ j ≤ n let U (j) be that subset Uk that contains xj . Then in particular U (0)
Analysis of Dynamical Recognizers
1131
contains x0 and is therefore the initial state of Mi . Since xj = wσj (xj−1 ) we have (j) U (j−1) ∩ w−1 6= ∅, σj U
so U (j−1) is connected to U (j) with an edge labeled by σj . As Mi is deterministic, this must be the only such edge emanating from U (j−1) . By induction on j, it follows that Mi must be put into state U (j) when string σ1 , . . . , σj is input. In particular, once the whole string has been input, Mi will be in state U (n) , so 6 is accepted by Mi
⇔ ⇔ ⇔
U (n) ⊆ Uacc xn ∈ Uacc 6 is accepted by the network.
To show that Mi+1 = Mi and no further subdivisions are made, suppose to the contrary that there are two points x1 and x2 that lie in the same element Uj −1 of the partition Si , but in different elements of Si+1 = Si ∨w−1 0 (Si )∨w1 (Si ). Then it must be for σ = either 0 or 1 that wσ (x1 ) lies in some Uk ∈ Si while wσ (x2 ) lies in Ul with Uk 6= Ul . But then Uj would be connected to both Uk and Ul with edges labeled by σ , contradicting the assumption that Mi was deterministic. If the induced language is not regular, the machines Mi in theory will continue to grow indefinitely in complexity. Of course in practice we cannot implement the above procedure exactly for our trained networks but must instead use a discrete approximation. Following the analysis of Giles et al. (1992) we can approximate A0 computationally along the lines of equation 3.1 as follows. First, “discretize” the system at resolution r by dividing the state space into r d points in a regular lattice and approximating w0 and ˆ is the w1 with functions wˆ 0 and wˆ 1 that act on these points such that wˆ σ (x) ˆ Each Yi will then be represented by a finite set nearest lattice point to wσ (x). Yˆ i , and the condition Yˆ i+1 ⊇ Yˆ i guarantees that the procedure will terminate to produce a discrete approximation Aˆ 0 for A0 . In discrete form, the above procedure may be equated with the Hopcroft minimization algorithm (Hopcroft and Ullman, 1979), or the method of εmachines (Crutchfield & Young, 1990; Crutchfield, 1994), and was first used in the present context by Giles et al. (1992). Using small values of r, their aim was to extract an FSA that might not model the network’s behavior exactly but would model it closely enough to classify the training data faithfully. We had in mind a different purpose: to test empirically whether the network is regular (modeling an FSA) and to analyze its behavior in more fine-grained detail. The minimization procedure described above is bound to terminate in finite time due to the finite resolution of the representation.
1132
Alan D. Blair and Jordan B. Pollack
Table 1: Epochs to Learning (L) and Regularity (R) for the Seven Tomita Languages. Network
N1 L/R
N2 N3 L
L
N4 L
N5 R
L/R
N6 L
N7 R
L
Epochs to learning 200 600 400 200 — 800 200 — 600 Epochs to regularity 200 — — — 400 800 — 600 — 200 × 200 FSA size 2 341 272 2052 4 21 7123 3 264 500 × 500 FSA size 2 881 808 7248 4 7 39200 3 806 Tomita’s FSA size 2 3 5 — 4 4 — 3 5
However, as the resolution is scaled up, the size of the largest FSA generated should in principle stabilize in the case of regular networks and grow rapidly in the case of nonregular ones. Therefore, as an empirical test of regularity, we perform these computations at two different resolutions (200 × 200 and 500 × 500) and report (in rows 4 and 5 of Table 1) the size (number of states) of the largest FSA generated. If the largest FSAs generated at the different resolutions are the same, we take this as an indication that the network is regular; if they are vastly different, that it is not regular. This allows us to probe the nature of the network’s behavior as the training progresses. In some cases the network undergoes a phase transition to regularity at some point in its training, and we are able to measure with considerable precision the time at which this phase transition occurs. It should be stressed, however, that these discretized computations do not constitute a formal proof but merely provide strong empirical evidence as to the regularity or otherwise of the network and its induced language. More rigorous results may be found in Casey (1996, 1993), who described the general dynamical properties that a neural network or any dynamical system has to have in order to model an FSA robustly. Tino, ˇ Horne, Giles, & Collingwood (1995) also made a detailed analysis of networks with a two-dimensional state space under certain assumptions about the sign and magnitude of the weights. It is important also to note that the regularity we are testing for is a property of the trained network and not an intrinsic property of the input strings, since the (necessarily finite) training data may be generalized in an infinite number of ways, each producing an equally valid induced language that may be regular or nonregular. Good symbolic algorithms already exist for finding a regular language compatible with given input data (Trakhtenbrot & Barzdin’, 1973; Lang, 1992). Our purpose is rather to analyze the kinds of languages that a dynamical system such as a neural network is likely to come up with when trained on those data. We do not claim that our methods are efficient when scaled up to higher dimensions. Rather, we hope that a detailed analysis of networks in low dimensions will lead to a better understanding of the phenomenon of regular versus nonregular network behavior in general.
Analysis of Dynamical Recognizers
1133
Finally, we remark that the functions w0 and w1 map X continuously into itself and as such define an iterated function system (IFS) (Barnsley, 1988) as noted in Kolen (1994). The attractor A of this IFS may be defined as follows: 1. Z0 = X . 2. For i ≥ 0, Zi+1 = w0 (Zi ) ∪ w1 (Zi ) T 3. A = i≥0 Zi .
[by induction Zi+1 ⊆ Zi ].
As with A0 , we can find a discrete approximation Aˆ for A at any given resolution. We have found empirically that the convergence to the attractor was very rapid for our trained networks so that Aˆ 0 was equal to a subset of Aˆ plus a very small number of “transient” points. For this reason we call A0 a subattractor of the IFS. If w0 and w1 were contractive maps, A0 would contain the whole of A (Barnsley, 1988), but in our case they are generally not contractive, so Aˆ 0 may contain only part of Aˆ . The close relationship between Aˆ 0 and Aˆ should make it possible to analyze the general family of languages accepted by the network as x0 varies, though we do not pursue this line of inquiry here. 4 Results Networks with the architecture described in section 2 were trained to recognize formal languages using backpropagation through time (Williams & Zipser, 1989; Rumelhart, Hinton, & Williams, 1986), with a modification similar to Quickprop (Fahlman, 1989)1 and a learning rate of 0.03. The weights were updated in batch mode, and the perceptron weights Pj were Pd constrained by rescaling to the region where j=1 Pj 2 ≤ 1. In contrast to Pollack (1991) where the backpropagation was truncated, we backpropagated through all levels of recurrence, as in Williams and Zipser (1989). In addition, we allowed the initial point x0 to vary as part of the backpropagation, a modification that was also developed in independent work by Forcada and Carrasco (1995). Our seven groups of training strings (see appendix) were copied exactly from Tomita (1982) except that we did not include the empty string in our training sets. Rows 4 and 5 of Table 1 show the size (number of states) of the largest FSA generated from the trained networks at two different resolutions 1
Specifically, the cost function we used was
³
1+z 1 E = − (1 + s)2 log 2 1+s
´
−
³
1−z 1 (1 − s)2 log 2 1−s
´
+ s(z − s)
(where z is the actual output and s the desired output), which leads to a delta rule of δ = (1 − sz)(s − z).
1134
Alan D. Blair and Jordan B. Pollack
Figure 2: Subattractor, weights, and equivalent FSA for network N1.
by the methods described in section 3. For comparison, row 6 shows the size of the minimal FSAs that Tomita found by exhaustive search. A network can be said to have learned all its training data once the output is positive for all accept strings and negative for all reject strings (a max error of 1.0), but it makes sense to aim for a lower maximum error in order to provide a safety margin. For network N5, the maximum error reached its lowest value of 0.46 after 800 epochs. The other networks were trained until they achieved a maximum error of less than 0.4, the number of epochs required being shown in row 2 (epochs to learning). Since the test for regularity is computationally intensive, we did not test our networks at every epoch but only at intervals of 200 epochs. For network N1, the derived FSA is the same for both resolutions, providing evidence that the induced language is regular. For networks N2, N3, N4, N6, and N7 on the other hand, the maximum FSA grows dramatically as we increase the resolution, suggesting that the induced language is not regular. We continued to train these networks to see if they would later become regular and found that networks N4 and N6 became regular after 400 and 600 epochs, respectively, as indicated in row 3 (epochs to regularity), while networks N2, N3, and N7 remained nonregular even after 10,000 epochs. For N5, the size of the maximum FSA actually decreased with higher resolution from 21 to 7, suggesting that the induced language is regular but that high resolution is required to detect its regularity. As an illustration, Figure 2 shows the subattractor for N1, which learned its training data after 200 epochs. The axes measure the activations of hidden nodes x1 and x2 , respectively. The initial point x0 is indicated by a cross (upper-right corner). Also shown is the dividing line between accept points and reject points. Roughly speaking, w0 squashes the entire state space into
Analysis of Dynamical Recognizers
1135
Figure 3: The phase transition of N4.
the lower-left corner, while w1 flattens it out onto the line x1 = x2 . The dividing line separates A0 into two pieces: Uacc in the upper-right corner and Urej in the lower left. w1 maps each piece back into itself, while w0 maps both pieces into Urej . The ε-machine method thus produces the two-state FSA shown at the right. (Final states are indicated by a double circle, the initial state by an open arrow.) In this simple case the FSA can be shown to model the network’s behavior exactly. Of course, N1 was trained on an extremely easy task that could have been learned with even fewer weights. Figure 3 shows network N4 captured at its phase transition. After 371 epochs (left) it has learned the training set, but the induced language is not regular. At the next epoch (right), it has become regular. Although the weights have been shifted by only 0.004 units in euclidean space, the dynamics of the two networks are quite different. Applied to the left network at 500 × 500 resolution, our analysis produced a series of FSAs of maximum size 2564. Applied to the right network, it terminated after three steps to produce the same four-state FSA that Tomita (1982) found by exhaustive search. The states of the FSA correspond to the following four subsets of A0 : U1 in the upper-left corner, U2 around (−0.6, −0.9), U3 barely visible at (0.4, −1.0), and U4 in the lower-right corner. The details of the successive subdivisions are outlined in Figure 4.
1136
Figure 4: The analysis of N4.
Alan D. Blair and Jordan B. Pollack
Analysis of Dynamical Recognizers
1137
Figure 5: N2, which is not regular, and N5, which is regular but with no obvious clusters.
The above examples would also be amenable to previous, clusteringbased approaches because of the way they “partition [their] state space into fairly well-separated, distinct regions or clusters” as hypothesized by Giles et al. (1992). Those shown in Figure 5 seem to be trickier. The subattractor for N5 (right) appears to be a bunch of scattered points with no obvious clustering, yet our fine-grained analysis was able to extract from it a seven-node FSA—a little larger than the minimal FSA of four nodes Tomita found. Network N2 (left) seems to have induced a nonregular language. Figure 6 shows the first four iterations of analysis applied to it. Note that each FSA refines the previous one, bringing to the light more details. Much can be learned about the induced language by examining these finite state approximations. For example, we can infer from M2 that the network rejects all strings ending in 1 and from M3 that it accepts all nonempty strings of the form (10)∗ but rejects any string ending in 110.
1138
Alan D. Blair and Jordan B. Pollack
0,1
M1
0,1 0
0,1
M2
1 0,1
0
0,1
1
0 1
M3
0,1
1
1
0
0
0,1
0,1
0
0,1
M4
0 1
1 0,1
1
0
1
1
1
0
0
0
1
1
0
0,1
1
0,1 0
1
1
0,1
1 0
0
Figure 6: The first four steps of analysis applied to N2.
5 Conclusion By allowing the decision boundary and the initial point to vary, our networks with two hidden nodes were able to induce languages from all the data sets of Tomita (1982) within a few hundred epochs. Many researchers implicitly regard an extracted FSA as superior to the trained network from which it was extracted (Omlin and Giles, 1996) with regard to predictability, compactness of description, and the particular way each generalizes to classify new, unseen input strings. For this reason, earlier work in the field focused on extracting an FSA that approximates the
Analysis of Dynamical Recognizers
1139
behavior of the network. However, that approach is imprecise if the network has induced a nonregular language and does not exactly model an FSA. We have provided a fine-grained analysis for a number of trained networks, both regular and nonregular, using an approach similar to the method of ε-machines that Crutchfield and Young (1990) used to analyze certain handcrafted dynamical systems. In particular, we were able to measure empirically whether the induced language was regular. The fact that several of the networks induced nonregular languages suggests a discrepancy between languages that are “simple” for dynamical recognizers and those that are “simple” from the point of view of automata theory (the regular languages). It is easier for these learning systems to induce a nonregular language to fit the sparse data of the Tomita training sets rather than the expected minimal regular language. The use of comprehensive training sets, or intentional heuristics, might constrain networks away from these interesting dynamics. It could be argued that the network and FSA ought to be seen on a more equal footing, since the 17 parameters of the network provide a compactness of description comparable to that of the FSA, and the language induced by the network is in principle on a par with that of the FSA in the sense that both generalize the same training data. We hope that further work in this direction may lead to a better understanding of network dynamics and help to clarify, compare, and contrast the relative merits of symbolic and dynamical systems.
Appendix: Tomita’s Data Sets N1 Accept
N1 Reject
N2 Accept
N2 Reject
1 11 111
0 10 01
10 1010 101010
1 0 11
1111 11111 111111 1111111 11111111
00 011 110 11111110 10111111
10101010 10101010101010
00 01 101 100 1001010 10110 110101010
N3 Accept
N3 Reject
N4 Accept
N4 Reject
1 0
10 101
1 0
000 11000
01 11 00 100 110 111 000 100100 110000011100001
010 1010 1110 1011 10001 111010 1001000 11111000 0111001101
10 01 00 100100 001111110100 0100100100 11100 0010
0001 000000000 11111000011 1101010000010111 1010010001 0000 00000
111101100010011100
11011100110
1140
Alan D. Blair and Jordan B. Pollack
N5 Accept
N5 Reject
N6 Accept
N6 Reject
11 00 1001 0101 1010 1000111101
0 111 010 000000000 1000 01
10 01 1100 101010 111 000000
1 0 11 00 101 011
1001100001111010 111111 0000
10 1110010100 010111111110 0001 011
10111 0111101111 100100100
11001 1111 00000000 010111 101111011111 1001001001
N7 Accept
N7 Reject
1 0 10 01
1010 00110011000 0101010101 1011010
11111 000 00110011 0101 0000100001111
10101 010100 101001 100100110101
00100 011111011111 00
Acknowledgments This research was funded by a Krasnow Foundation Postdoctoral Fellowship, by ONR grant N00014-95-0173, and by NSF grant IRI-95-29298. We are indebted to David Wittenberg for helping to improve its presentation and to Mike Casey for stimulating discussions. References Barnsley, M. F. (1988). Fractals everywhere. San Diego, CA: Academic Press. Casey, M. (1993). Computation dynamics in discrete-time recurrent neural networks. Proceedings of the Annual Research Symposium of UCSD Institute for Neural Computation, 78–95. Casey, M. (1996). The dynamics of discrete-time computation, with application to recurrent neural networks and finite state machine extraction. Neural Computation, 8(6). Cleeremans, A., Servan-Schreiber, D., & McClelland, J. (1989). Finite state automata and simple recurrent networks. Neural Computation, 1(3), 372–381. Crutchfield, J. P. (1994). The calculi of emergence: Computation, dynamics and induction. Physica D, 75, 11–54. Crutchfield, J. P., & Young, K. (1990). Computation at the onset of chaos. In W. H. Zurek (Ed.), Complexity, Entropy and the Physics of Information. Reading, MA: Addison-Wesley. Das, S., & Mozer, M. C. (1994). A unified gradient-descent/clustering architecture for finite state machine induction. Neural Information Processing Systems, 6, 19–26.
Analysis of Dynamical Recognizers
1141
Fahlman, S. E. (1989). Fast-learning variations on back-propagation: An empirical study. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 38–51). San Mateo, CA: Morgan Kaufmann. Forcada, M. L., & Carrasco, R. C. (1995). Learning the initial state of a secondorder recurrent neural network during regular-language inference. Neural Computation, 7(5), 923–930. Frasconi, P., Gori, M., & Soda, G. (1995). Recurrent neural networks and prior knowledge for sequence processing: A constrained nondeterministic approach. Knowledge Based Systems, 8(6), 313–332. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., & Lee, Y. C. (1992). Learning and extracting finite state automata with second-order recurrent neural networks. Neural Computation, 4(3), 393–405. Hopcroft, J. E., Ullman, J. D. (1979). Introduction to automata theory, languages, and computation. Reading, MA: Addison-Wesley. Jordan, M. I. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. Proceedings of the Eighth Conference of the Cognitive Science Society. Amherst, MA, 531–546. Kolen, J. F. (1993). Fool’s gold: Extracting finite state machines from recurrent network dynamics. Neural Information Processing Systems, 6, 501–508. Kolen, J. F. (1994). Exploring the computational capabilities of recurrent neural networks. Unpublished doctoral dissertation, Ohio State University. Lang, K. J. (1992). Random DFA’s can be approximately learned from sparse uniform examples. Proc. Fifth ACM Workshop on Computational Learning Theory, 45–52. Manolios, P., & Fanelli, R. (1994). First order recurrent neural networks and deterministic finite state automata. Neural Computation, 6(6), 1155–1173. Omlin, C. W., & Giles, C. L. (1996). Extraction of rules from discrete-time recurrent neural networks. Neural Networks, 9(1), 41. Pollack, J. B. (1987). Cascaded back propagation on dynamic connectionist networks. Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, 391–404. Pollack, J. B. (1991). The induction of dynamical recognizers. Machine Learning, 7, 227–252. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. Siegelmann, H. T., & Sontag, E. D. (1992). On the computational power of neural networks. Proceedings of the Fifth ACM Workshop on Computational Learning Theory, Pittsburgh, PA. Tino, ˇ P., Horne, B. G., Giles, C. L., & Collingwood, P. C. (1995). Finite state machines and recurrent neural networks—Automata and dynamical systems approaches (Tech. Rep. No. UMIACS-TR-95-1). College Park, MD: Institute for Advanced Computer Studies, University of Maryland. Tino, ˇ P., & Sajda, J. (1995). Learning and extracting initial mealy automata with a modular neural network model. Neural Computation, 7(4), 822. Tomita, M. (1982). Dynamic construction of finite-state automata from examples
1142
Alan D. Blair and Jordan B. Pollack
using hill-climbing. Proceedings of the Fourth Annual Cognitive Science Conference, Ann Arbor, MI, 105–108. Trakhtenbrot, B. A., & Barzdin’, Ya. M. (1973). Finite automata: Behavior and synthesis. Amsterdam: North-Holland. Watrous, R. L., & Kuhn, G. M. (1992). Induction of finite state languages using second-order recurrent networks. Neural Computation, 4(3), 406–414. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2), 270. Zeng, Z., Goodman, R. M., & Smyth, P. (1994). Learning finite state machines with self-clustering recurrent networks. Neural Computation, 5(6), 976–990.
Received September 19, 1995; accepted July 30, 1996.
Communicated by David Haussler
A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split Michael Kearns AT&T Research Labs, Murray Hill, NJ 07974, U.S.A.
We give a theoretical and experimental analysis of the generalization error of cross validation using two natural measures of the problem under consideration. The approximation rate measures the accuracy to which the target function can be ideally approximated as a function of the number of parameters, and thus captures the complexity of the target function with respect to the hypothesis model. The estimation rate measures the deviation between the training and generalization errors as a function of the number of parameters, and thus captures the extent to which the hypothesis model suffers from overfitting. Using these two measures, we give a rigorous and general bound on the error of the simplest form of cross validation. The bound clearly shows the dangers of making γ —the fraction of data saved for testing—too large or too small. By optimizing the bound with respect to γ , we then argue that the following qualitative properties of cross-validation behavior should be quite robust to significant changes in the underlying model selection problem: • When the target function complexity is small compared to the sample size, the performance of cross validation is relatively insensitive to the choice of γ . • The importance of choosing γ optimally increases, and the optimal value for γ decreases, as the target function becomes more complex relative to the sample size. • There is nevertheless a single fixed value for γ that works nearly optimally for a wide range of target function complexity. 1 Introduction In this article, we analyze the performance of cross validation in the context of model selection and complexity regularization.1 We work in a setting 1 Perhaps in conflict with accepted usage in statistics, where the method is often referred to as keeping a “hold-out” set, here we use the term cross validation to mean the simple method of saving out an independent test set to perform model selection. Precise definitions will be stated shortly.
Neural Computation 9, 1143–1161 (1997)
c 1997 Massachusetts Institute of Technology °
1144
Michael Kearns
in which we must choose the right number of parameters for a hypothesis function in response to a finite training sample, with the goal of minimizing the resulting generalization error. There is a large and interesting literature on cross validation methods, which often emphasizes asymptotic statistical properties, or the exact calculation of the generalization error for simple models. (The literature is too large to survey here; foundational papers include Stone, 1974, 1977.) Our approach here is somewhat different and is primarily inspired by two sources. The first is the work of Barron and Cover (1991), who introduced the idea of bounding the error of a model selection method (in their case, the minimum description length principle) in terms of a quantity known as the index of resolvability. The second is the work of Vapnik (1982), who provided extremely powerful and general tools for uniformly bounding the deviation between the training and generalization error. (For related methods and problems, we refer the reader to Devroye (1988).) We combine these methods to give a new and general analysis of crossvalidation performance. In the first and more formal part of the article, we give a rigorous bound on the error of cross validation in terms of two parameters of the underlying model selection problem: the approximation rate and the estimation rate. Taken together, these two problem parameters determine our analog of the index of resolvability, and the work of Vapnik and others yields estimation rates applicable to many natural problems. In the second and more experimental part of the article, we investigate the implications of our bound for choosing γ , the fraction of data withheld for testing in cross validation. The most interesting aspect of this analysis is the identification of several qualitative properties of the optimal γ that appear to be invariant over a wide class of model selection problems. 2 The Formalism We consider model selection as a two-part problem: choosing the appropriate number of parameters for the hypothesis function and tuning these parameters. The training sample is used in both steps of this process. In many settings, the tuning of the parameters is determined by a fixed learning algorithm such as backpropagation, and then model selection reduces to the problem of choosing the architecture (which determines the number of weights, the connectivity pattern, and so on). For concreteness, we adopt an idealized version of this division of labor. We assume a nested sequence of function classes H1 ⊂ · · · ⊂ Hd · · ·, called the structure (Vapnik, 1982), where Hd is a class of boolean functions of d parameters, each function being a mapping from some input space X into {0, 1}. For simplicity, in this article we assume that the Vapnik-Chervonenkis (VC) dimension (Vapnik & Chervonenkis, 1971; Vapnik, 1982) of the class Hd is O(d). To remove this assumption, one simply replaces all occurrences of d in our bounds by the VC dimension of Hd . We assume that we have in our possession a learning
Bound on the Error of Cross Validation
1145
algorithm L that on input any training sample S and any value d will output a hypothesis function hd ∈ Hd that minimizes the training error over Hd — that is, ²t (hd ) = minh∈Hd {²t (h)}, where ²t (h) is the fraction of the examples in S on which h disagrees with the given label. In many situations, training error minimization is known to be computationally intractable, leading researchers to investigate heuristics such as backpropagation. The extent to which the theory presented here applies to such heuristics will depend in part on the extent to which they approximate training error minimization for the problem under consideration. Model selection is thus the problem of choosing the best value of d. More precisely, we assume an arbitrary target function f (which may or may not reside in one of the function classes in the structure H1 ⊂ · · · ⊂ Hd · · ·), and an input distribution P; f and P together define the generalization error function ²g (h) = Prx∈P [h(x) 6= f (x)]. We are given a training sample S of f , consisting of m random examples drawn according to P and labeled by f (with the labels possibly corrupted by a noise process that randomly complements each label independently with probability η < 1/2). In many model selection methods, such as Rissanen’s minimum description length (MDL) principle (Rissanen, 1989) and Vapnik’s guaranteed risk minimization (Vapnik, 1982), for each value of d = 1, 2, 3, . . . we give the entire sample S and d to the learning algorithm L to obtain the function hd minimizing the training error in Hd . A “penalty” for complexity (a function of d and m) is then added to the training errors ²t (hd ) to choose among the sequence h1 , h2 , h3 , . . .. Whatever the method, we interpret the goal to be that of minimizing the generalization error of the hypothesis selected. In this article, we will make the rather mild but very useful assumption that the structure has the property that for any sample size m, there is a value dmax (m) such that ²t (hdmax (m) ) = 0 for any labeled sample S of m examples.2 We call the function dmax (m) the fitting number of the structure. The fitting number formalizes the simple notion that with enough parameters, we can always fit the training data perfectly, a property held by most sufficiently powerful function classes (including multilayer neural networks). We typically expect the fitting number to be a linear function of m or at worst a polynomial in m. The significance of the fitting number for us is that no reasonable model selection method should choose hd for d ≥ dmax (m), since doing so simply adds complexity without reducing the training error. In this article we concentrate on the simplest version of cross validation. Unlike the methods mentioned above, which use the entire sample for training the hd , in cross validation we choose a parameter γ ∈ [0, 1], which determines the split between training and test data. Given the input sample S of m examples, let S0 be the subsample consisting of the first (1 − γ )m examples in S, and S00 the subsample consisting of the last γ m examples. In
2
Weaker definitions for dmax (m) also suffice, but this is the simplest.
1146
Michael Kearns
cross validation, rather than giving the entire sample S to L, we give only the smaller sample S0 , resulting in the sequence h1 , . . . , hdmax ((1−γ )m) of increasingly complex hypotheses. Each hypothesis is now obtained by training on only (1 − γ )m examples, which implies that we will only consider values of d smaller than the corresponding fitting number dmax ((1 − γ )m); let us γ introduce the shorthand dmax for dmax ((1 − γ )m). Cross validation chooses the hd satisfying hd = mini∈{1,...,dγmax } {²t00 (hi )} where ²t00 (hi ) is the error of hi on the subsample S00 . Notice that we are not considering multifold cross validation, or other variants that make more efficient use of the sample, because our analyses will require the independence of the test set. However, we believe that many of the themes that emerge here may apply to these more sophisticated variants as well. We use ²cv (m) to denote the generalization error ²g (hd ) of the hypothesis hd chosen by cross validation when given as input a sample S of m random examples of the target function. Obviously, ²cv (m) depends on S, the structure, f , P, and the noise rate. When bounding ²cv (m), we will use the expression with high probability to mean with probability 1 − δ over the sample S, for some small fixed constant δ > 0. All of our results can also be stated with δ as a parameter at the cost of a log(1/δ) factor in the bounds, or in terms of the expected value of ²cv (m). 3 The Approximation Rate It is apparent that any nontrivial bound on ²cv (m) must take account of some measure of the complexity of the unknown target function f . The correct measure of this complexity is less obvious. Following the example of Barron and Cover’s (1991) analysis of MDL performance in the context of density estimation, we propose the approximation rate as a natural measure of the complexity of f and P in relation to the chosen structure H1 ⊂ · · · ⊂ Hd · · ·. Thus we define the approximation rate function ²g (d) to be ²g (d) = minh∈Hd {²g (h)}. The function ²g (d) tells us the best generalization error that can be achieved in the class Hd , and it is a nonincreasing function of d. If ²g (s) = 0 for some sufficiently large s, this means that the target function f , at least with respect to the input distribution, is realizable in the class Hs , and thus s is a coarse measure of how complex f is. More generally, even if ²g (d) > 0 for all d, the rate of decay of ²g (d) still gives a nice indication of how much representational power we gain with respect to f and P by increasing the complexity of our models. Still missing, of course, is some means of determining the extent to which this representational power can be realized by training on a finite sample of a given size, but this will be added shortly. First we give examples of the approximation rate that we will examine in some detail following the general bound on ²cv (m). 3.1 The Intervals Problem. In this problem, the input space X is the real interval [0, 1], and the class Hd of the structure consists of all boolean step
Bound on the Error of Cross Validation
1147
Three Approximation Rates 1
0.8
0.6
y
0.4
0.2
0
50
100
150 d
200
250
300
Figure 1: Plots of three approximation rates. For the intervals problem with target complexity s = 250 intervals (linear plot intersecting d-axis at 250), for the Perceptron problem with target complexity s = 150 nonzero weights (nonlinear plot intersecting d-axis at 150), and for power law decay asymptoting at ²min = 0.05.
functions over [0, 1] of at most d steps; thus, each function partitions the interval [0, 1] into at most d disjoint segments (not necessarily of equal width) and assigns alternating positive and negative labels to these segments. The input space is one-dimensional, but the structure contains arbitrarily complex functions over [0, 1]. It is easily verified that our assumption that the VC dimension of Hd is O(d) holds here, and that the fitting number obeys dmax (m) ≤ m. Now suppose that the input density P is uniform, and suppose that the target function f is the function of s alternating segments of equal width 1/s, for some s (thus, f lies in the class Hs ). We will refer to these settings as the intervals problem. Then the approximation rate is ²g (d) = (1/2)(1 − d/s) for 1 ≤ d < s and ²g (d) = 0 for d ≥ s (see Figure 1). Thus, as long as d < s, increasing the complexity d gives linear payoff in terms of decreasing the optimal generalization error. For d ≥ s, there is no payoff for increasing d, since we can already realize the target function. The reader can easily verify that if f lies in Hs but does not have equal width intervals, ²g (d) is still piecewise linear, but for d < s is concave up: the gain in approximation obtained by incrementally increasing d diminishes as d becomes larger. Although the intervals problem is rather simple and artifi-
1148
Michael Kearns
cial, a precise analysis of cross-validation behavior can be given for it, and we will argue that this behavior is representative of much broader and more realistic settings. 3.2 The Perceptron Problem. In this problem, the input space X is 0 are parameters capturing the rate of decay to ²min (see Figure 1). Our analysis suggests that the qualitative phenomena we identify are invariant to wide ranges of choices for these parameters. Note that for all three cases, there is a natural measure of target function complexity captured by ²g (d). In the intervals problem it is the number of target intervals, in the Perceptron problem it is the number of nonzero weights in the target, and in the more general power law case it is captured by the parameters of the power law. In the later part of the article, we will study cross-validation performance as a function of these complexity measures, and obtain remarkably similar predictions.
3 Since the bounds we will give have straightforward generalizations to real-valued function learning under squared error, examining behavior for ²g (d) in this setting seems reasonable.
Bound on the Error of Cross Validation
1149
4 The Estimation Rate For a fixed f , P and H1 ⊂ · · · ⊂ Hd · · ·, we say that a function ρ(d, m) is an estimation rate bound if for all d and m, with high probability over the sample S we have |²t (hd ) − ²g (hd )| ≤ ρ(d, m), where as usual hd is the result of training error minimization on S within Hd . Thus ρ(d, m) simply bounds the deviation between the training error and the generalization error of hd .4 Note that the best such bound may depend in a complicated way on all of the elements of the problem: f , P, and the structure. Indeed, much of the recent work on the statistical physics theory of learning curves has documented the wide variety of behaviors that such deviations may assume (Seung et al., 1992; Haussler, Kearns, Seung, & Tishby, 1994). However, for many natural problems it is both convenient and accurate to rely on a universal estimation rate bound provided by the powerful theory of uniform convergence: Namely, for any f , P, and any structure, the function ¶ µq (d/m) log(m/d) (4.1) ρ(d, m) = O is an estimation rate bound (Vapnik, 1982).5 Depending on the details of the problem, it is sometimes appropriate to omit the log(m/d) factor and p often appropriate to refine the d/m behavior topa function that interpolates smoothly between d/m behavior for small ²t to d/m for large ²t . Although such refinements are both interesting and important, many of the qualitative claims and predictions we will make are invariant to them as long as the deviation |²t (hd ) − ²g (hd )| is well approximated by a power law (d/m)α (α > 0); it will be more important to recognize and model the cases in which power law behavior is grossly violated. Note that this universal estimation rate bound holds only under the assumption that the training sample is noise free, but straightforward generalizations exist. For instance, if the training datapare corrupted by random label noise at rate 0 ≤ η < 1/2, then ρ(d, m) = O( (d/(1 − 2η)2 m) log(m/d)) is again a universal estimation rate bound. It is important to realize that once we have an estimation rate bound, it is a straightforward matter to derive a natural approach to the problem of model selection. Namely, since we have ²g (hd ) ≤ ²t (hd ) + ρ(d, m)
(4.2)
with high probability, if we assume equality in equation 4.2, we might be led to choose the value of d minimizing the quantity ²t (hd ) + ρ(d, m). This is 4 In the later part of the article, we will often assume that specific forms of ρ(d, m) are not merely upper bounds on this deviation but accurate approximations to it. 5 The results of Vapnik actually show the stronger result that |² (h) − ² (h)| = t g
p
O( (d/m) log(m/d)) for all h ∈ Hd , not only for the training error minimizer hd .
1150
Michael Kearns
exactly the approach taken in Vapnik’s guaranteed risk minimization (Vapnik, 1982), and it leads to very general and powerful bounds on generalization error for the approach just described. Our motivation here, however, is slightly different; we are interested in providing similar bounds for cross validation, primarily because it is an approach in wide experimental use. Thus, rather than assuming an estimation rate bound and examining the model selection method suggested by the minimization of equation 4.2, we are instead examining what can be said about cross validation if we assume certain approximation and estimation rates. After giving a general bound on ²cv (m) in which the approximation and estimation rate functions are parameters, we investigate the behavior of ²cv (m) (and more specifically, of the parameter γ ) for specific choices of these parameters. 5 The Bound Theorem 1. Let H1 ⊂ · · · ⊂ Hd · · · be any structure, where the VC dimension of Hd is O(d). Let f and P be any target function and input distribution, let ²g (d) be the approximation rate function for the structure with respect to f and P, and let ρ(d, m) be an estimation rate bound for the structure with respect to f and P. Then for any m, with high probability
²cv (m) ≤
minγ
γ log(d ) max , (5.1) ²g (d) + ρ(d, (1 − γ )m) + O γm ª
©
1≤d≤dmax
s
γ
where γ is the fraction of the training sample used for testing, and dmax is the fitting number dmax ((1−γ )m). Using the universal estimation bound rate of equation 4.1 and the rather weak assumption that dmax (m) is polynomial in m, we obtain that with high probability Ãs
( ²cv (m) ≤
minγ
1≤d≤dmax
Ãs
+O
²g (d) + O
!)
³m´ d log (1 − γ )m d !
(5.2)
log((1 − γ )m) . γm
Straightforward generalizations of these bounds for the case where the data are corrupted by classification noise can be obtained using the modified estimation rate bound given in section 4.6 6 The main effect of classification noise at rate η is the replacement of occurrences in the bound of the sample size m by the smaller effective sample size (1 − η)2 m.
Bound on the Error of Cross Validation
1151
Proof Sketch. We have space only to highlight the main ideas of the proof. γ For each d from 1 to dmax , fix a function fd ∈ Hd satisfying ²g ( fd ) = ²g (d); thus, fd is the best possible approximation to the target f within the class it can be shown that with high Hd . By a standard Chernoff bound argument q γ
γ
probability we have |²t ( fd ) − ²g ( fd )| ≤ log(dmax )/m for all 1 ≤ d ≤ dmax . This means q that within each Hd , the minimum training error ²t (hd ) is at most ²g (d) +
γ
log(dmax )/m. Since ρ(d, m) is an estimation rate bound, we have γ
that with high probability, for all 1 ≤ d ≤ dmax , ²g (hd ) ≤ ²t (hd ) + ρ(d, m) q γ ≤ ²g (d) + log(dmax )/m + ρ(d, m).
(5.3) (5.4) γ
Thus we have bounded the generalization error of the dmax hypotheses hd , only one of which will be selected. If we knew the actual values of these generalization errors (equivalent to having an infinite test sample in cross γ validation), we could bound our error by the minimum over all 1 ≤ d ≤ dmax of the expression 5.4. However, in cross validation we do not know the exact values of these generalization errors but must instead use the γ m testing examples to estimate them. Again q by standard Chernoff bound arguments, γ
this introduces an additional log(dmax )/(γ m) error term, resulting in our final bound. This concludes the proof sketch. In the bounds given by equations 5.1 and 5.2, the min{·} expression is analogous to Barron and Cover’s (1991) index of resolvability. The final term in the bounds represents the error introduced by the testing phase of cross validation. These bounds exhibit trade-off behavior with respect to the parameter γ . As we let γ approach 0, we are devoting more of the sample to training the hd , and the estimation q rate term ρ(d, (1 − γ )m) is decreasing. γ
However, the test error term O( log(dmax )/(γ m)) is increasing, since we have fewer data to estimate the ²g (hd ) accurately. The reverse phenomenon occurs as we let γ approach 1. While we believe theorem 1 to be enlightening and potentially useful in its own right, we would now like to take its interpretation a step further. More precisely, suppose we assume that the bound is an approximation to the actual behavior of ²cv (m). Then in principle we can optimize the bound to obtain the best value for γ . Of course, in addition to the assumptions involved—the main one being that ρ(d, m) is a good approximation to the training generalization error deviations of the hd —this analysis can only be carried out given information that we should not expect to have in practice (at least in exact form)—in particular, the approximation rate function ²g (d), which depends on f and P. However, we argue in the coming sections that several interesting qualitative phenomena regarding the choice of γ are largely invariant to a wide range of natural behaviors for ²g (d).
1152
Michael Kearns
6 A Case Study: The Intervals Problem We begin by performing the suggested optimization of γ for the intervals problem. Recall that the approximation rate here is ²g (d) = (1/2)(1−d/s) for d < s and ²g (d) = 0 for d ≥ s, where s is the complexity of the target function. Here we analyze the behavior obtainedpby assuming that the estimation rate ρ(d, m) actually behaves as ρ(d, m) = d/(1 − γ )m (so we are omitting the log factor from the universal bound),7 and to simplify the formal analysis apbit (but without changing the qualitative behavior) we replace the term p log((1 − γ )m)/(γ m) by the weaker log(m)/m. Thus, if we define the function q p (6.1) F(d, m, γ ) = ²g (d) + d/(1 − γ )m + log(m)/(γ m) then following equation 5.1, we are approximating ²cv (m) by ²cv (m) ≈ min1≤d≤dγmax {F(d, m, γ )}.8 The first step of the analysis is to fix a value for γ and differentiate F(d, m, γ ) with respect to d to discover the minimizing value of d. This differentiation must have two regimes due to the discontinuitypat d = s in ²g (d). It is easily verified that the derivative is −(1/2s) + 1/(2 d(1 − γ )m) p for d < s and 1/(2 d(1 − γ )m) for d ≥ s. It can be shown that provided that (1 − γ )m ≥ 4s then d = s is a global minimum of F(d, m, γ ), and if this condition is violated, then the value we obtain for ²cv (m) is vacuously large anyway (meaning that this fixed choice of γ cannot end up being the optimal one, or, if m < 4s, that our analysis claims we simply do not have enough data for nontrivial generalization, regardless of how we split it between training and p p testing). Plugging in d = s yields ²cv (m) ≈ F(s, m, γ ) = s/(1 − γ )m + log(m)/(γ m) for this fixed choice of γ . Now by differentiating F(s, m, γ ) with respect to γ , it can be shown that the optimal choice of γ under the assumptions is γopt =
(log(m)/s)1/3 . 1 + (log(m)/s)1/3
(6.2)
It is important to remember at this point that despite the fact that we have derived a precise expression for γopt , due to the assumptions and approxima7 It can be argued that this power law estimation rate is actually a rather accurate approximation for the true behavior of the training generalization deviations of the hd for this particular problem at moderate classification noise levels. For very small noise rates, a better approximation would replace the two square root terms by linear terms. 8 Note that although there are hidden constants in the O(·) notation of theorem 1, it is the relative weights of the estimation and test error terms that are important when we optimize equation 5.1 for γ . In the definition of F(d, m, γ ) we have chosen both terms to have constant multipliers equal to 1, which is reasonable because both terms have the same Chernoff bound origins.
Bound on the Error of Cross Validation
1153
cv error bound, intervals, d = s = 1000 slice, m=10000 0.5
0.4
0.3
y
0.2
0.1
00
0.2
0.4
0.6
0.8
1
trainfrac
Figure 2: Plot of the predicted generalization error of cross validation for the intervals model selection problem, as a function of the fraction 1 − γ of data used for training. (In the plot, the fraction of training data is 0 on the left (γ = 1) and 1 on the right (γ = 0).) The fixed sample size m = 10, 000 was used, and the six plots show the error predicted by the theory for target function complexity values s = 10 (bottom plot), 50, 100, 250, 500, and 1000 (top plot).
tions we have made in the various constants, any quantitative interpretation of this expression is meaningless. However, we can reasonably expect that this expression captures the qualitative way in which the optimal γ changes as the amount of data m changes in relation to the target function complexity s. On this score, the situation initially appears rather bleak, as the function (log(m)/s)1/3 /(1 + (log(m)/s)1/3 ) is quite sensitive to the ratio log(m)/s. For example, for m =10,000, if s = 10 we obtain γopt = 0.524 · · ·, if s = 100 we obtain γopt = 0.338 · · ·, and if s = 1000 we obtain γopt = 0.191 · · ·. Thus γopt is becoming smaller as log(m)/s becomes small, and the analysis suggests vastly different choices for γopt depending on the target function complexity, which is something we do not expect to have the luxury of knowing in practice. However, it is both fortunate and interesting that γopt does not tell the entire story. In Figure 2, we plot the function F(s, m, γ )9 as a function of γ 9
In the plots, we now use the more accurate test penalty term since we are no longer concerned with simplifying the calculation.
p
log((1 − γ )m)/(γ m)
1154
Michael Kearns
for m = 10,000 and for several different values of s (note that for consistency with the later experimental plots, the x-axis of the plot is actually the training fraction 1 − γ ). Here we can observe four important qualitative phenomena, which we list in order of increasing subtlety: 1. When s is small compared to m, the predicted error is relatively insensitive to the choice of γ . As a function of γ , F(s, m, γ ) has a wide, flat bowl, indicating a wide range of γ yielding essentially the same near-optimal error. 2. As s becomes larger in comparison to the fixed sample size m, the relative superiority of γopt over other values for γ becomes more pronounced. In particular, large values for γ become progressively worse as s increases. For example, the plots indicate that for s = 10 (again, m = 10,000), even though γopt = 0.524 · · · the choice γ = 0.75 will result in error quite near that achieved using γopt . However, for s = 500, γ = 0.75 is predicted to yield greatly suboptimal error. Note that for very large s, the bound predicts vacuously large error for all values of γ , so that the choice of γ again becomes irrelevant. 3. Because of the insensitivity to γ for when s is small compared to m, there is a fixed value of γ that seems to yield reasonably good performance for a wide range of values for s. This value is essentially the value of γopt for the case where s is large but nontrivial generalization is still possible, since choosing the best value for γ is more important there than for the small s case. 4. The value of γopt is decreasing as s increases. This is slightly difficult to confirm from the plot, but can be seen clearly from the precise expression for γopt . Despite the fact that the analysis so far has been rather specialized (addressing the behavior for a fixed structure, target function, and input distribution), it is our belief that the four phenomena just listed hold for many other model selection problems. For instance, we shall shortly demonstrate that the theory again predicts that these phenomena hold for the Perceptron problem and for the case of power law decay of ²g (d) described earlier. First we give an experimental demonstration that at least the predicted properties 1, 2, and 3 truly do hold for the intervals problem. In Figure 3, we plot the results of experiments in which labeled random samples of size m = 5000 were generated for a target function of s equal width intervals, for s =10,100 and 500. The samples were corrupted by random label noise at rate η = 0.3. For each value of γ and each value of d, (1 − γ )m of the sample was given to a program performing training error minimization within
Bound on the Error of Cross Validation
1155
err vs train set size, s=10,100,500, 30% noise, m=5000 0.5
0.4
0.3
0.2
0.1
00
1000
2000
3000
4000
5000
trainsize
Figure 3: Experimental plots of cross-validation generalization error in the intervals problem as a function of training set size (1 − γ )m. Experiments with the three target complexity values s =10,100, and 500 (bottom plot to top plot) are shown. Each point represents performance averaged over 10 trials.
Hd .10 The remaining γ m examples were used to select the best hd according to cross validation. The plots show the true generalization error of the hd selected by cross validation as a function of γ (the generalization error can be computed exactly for this problem). Each point in the plots represents an average over 10 trials. While there are obvious and significant quantitative differences between these experimental plots and the theoretical predictions of Figure 2 (indeed, even apparently systematic differences), the properties 1, 2, and 3 are rather clearly borne out by the data: 1. In Figure 3, when s is small compared to m, there is a wide range of acceptable γ ; it appears that any choice of γ between 0.10 and 0.50 yields nearly optimal generalization error. 2. By the time s = 100, the sensitivity to γ is considerably more pronounced. For example, the choice γ = 0.50 now results in clearly 10 A nice feature of the intervals problem is that training error minimization can be performed in almost linear time using a dynamic programming approach (Kearns, Mansour, Ng, & Ron, 1995).
1156
Michael Kearns
suboptimal performance, and it is more important to have γ close to 0.10. 3. Despite these complexities, there does indeed appear to be a single value of γ —approximately 0.10—that performs nearly optimally for the entire range of s examined. The fourth property—that the optimal γ decreases as the target function complexity is increased relative to a fixed m—is certainly not refuted by the experimental results, but any such effect is simply too small to be verified. It would be interesting to verify this prediction experimentally, perhaps on a different problem where the predicted effect is more pronounced.
7 Power Law Decay and the Perceptron Problem For the cases where the approximation rate ²g (d) obeys either power law decay or is that derived for the Perceptron problem discussed in section 3, the behavior of ²cv (m) as a function of γ predicted by our theory is largely the same. For example, if ²g (d) = (c/d), and we use the standard estimation p rate ρ(d, m) = d/(1 − γ )m, then an analysis similar to that performed for the intervals problem reveals that for fixed γ , the minimizing choice of d is d = (4c2 (1 − γ )m)1/3 ; plugging this value of d back into the bound on ²cv (m) and plotting for various values of c yields Figure 4. Similar to Figure 2 for the intervals problem, Figure 4 shows the predicted behavior of ²cv (m) as a function of γ for a fixed m and several different choices for the complexity parameter c. We again see that properties 1 through 4 hold strongly despite the change in ²g (d) from the intervals problem, although quantitative aspects of the prediction, which already must be taken lightly for reasons previously stated, have obviously changed, such as the interesting values of the ratio of sample size to target function complexity (that is, the values of this ratio for which the generalization error is most sensitive to the choice of γ ). Through a combination of formal analysis and plotting, it is possible to demonstrate that the properties 1 through 4 are robust to wide variations in the parameters α and ²min in the parametric form ²g (d) = (c/d)α + ²min , as well as wide variations in the form of the estimation rate ρ(d, m). For example, if ²g (d) = (c/d)2 (faster than the approximation rate examined above) and ρ = d/m (faster than the estimation rated examined above), then for the interesting ratios of m to c (that is, where the generalization error predicted is bounded away from 0 and the trivial value of 1/2), a figure quite similar to Figure 4 is obtained. Similar predictions can be derived for the Perceptron problem using the universal estimation rate bound or any similar power law form.
Bound on the Error of Cross Validation
1157
cv bound, (c/d) for c from 1.0 to 150.0, m=25000 0.5
0.4
0.3
y
0.2
0.1
00
0.2
0.4
0.6
0.8
1
trainfrac
Figure 4: Plot of the predicted generalization error of cross validation for the power law case ²g (d) = (c/d), as a function of the fraction 1 − γ of data used for training. The fixed sample size m =25,000 was used, and the six plots show the error predicted by the theory for target function complexity values c = 1 (bottom plot), 25, 50, 75, 100, and 150 (top plot).
8 Backpropagation Experiments As a final test of the predictions made by the theory presented here, we describe some experiments using the backpropagation learning algorithm for neural networks (see e.g., Rumelhart, Hinton, & Williams, 1986). In the first set of experiments (see Figure 5), backpropagation was used to train a neural network with two input units, eight hidden units, and a single output unit. The data were generated according to a nearest-neighbor function over the unit square in the real plane. More precisely, the target function was defined by s randomly chosen points (exemplars) in the unit square, along with randomly chosen labels for these exemplars. Any input in the unit square is then assigned the label of the nearest exemplar (with respect to Euclidean distance). Thus s is a measure of the complexity of the target function analogous to those for the intervals and Perceptron problems. For each such target function, 100 examples were generated and corrupted by random classification noise at rate η = 0.15. These examples were then divided into a training set and a testing set, which in this case was used to determine the apparent best number of backpropagation training epochs
1158
Michael Kearns err vs train fraction, backprop with increasing s 0.5
0.4
0.3
0.2
0.1
00
0.2
0.4
0.6
0.8
1
trainfrac
Figure 5: Experimental plots of cross-validation generalization error as a function of training set fraction (1 − γ ) for backpropagation, where m = 100 and the test set is used to determine the number of training epochs. Experiments with many different nearest-neighbor target functions are shown, with the number of exemplars s in the target function increasing from the bottom plot (s = 1, plot lies along horizontal axis) to the top plot (s = 128). Each point represents performance averaged over 300 trials.
to perform. For the purposes of the experiment, the generalization error of the chosen network was then estimated using an additional set of 8000 examples. Note that we are deviating here from the formal framework in which we began. Rather than explicitly searching for a good value of complexity in a nested sequence of increasingly complex hypothesis classes, here we are training within a single, fixed architecture. Intuitively, we are assuming that increased training time within this fixed architecture leads to increased effective complexity, which must be regulated by cross validation. This intuition, while still in need of rigorous theoretical and experimental examination, appears to have the status of a folk result within the neural network research community. Despite the deviation from our initial framework, the results shown in Figure 5 seem to confirm properties 1 through 3: sensitivity to the trainingtest split is relatively low for small s but increases as s becomes larger (until, as usual, s is sufficiently large to preclude nontrivial generalization for any split). Nevertheless, a fixed value for the split—in this case, somewhere
Bound on the Error of Cross Validation
1159
err vs train fraction, backprop with increasing noise 0.5
0.4
0.3
0.2
0.1
00
0.2
0.4
0.6
0.8
1
trainfrac
Figure 6: Experimental plots of cross-validation generalization error as a function of training set fraction (1 − γ ) for backpropagation, where m = 100 and the test set is used to determine the number of training epochs. Experiments with a fixed nearest-neighbor target function with s = 2 exemplars are shown, but with the noise rate increasing from the bottom plot (η = 0.0) to the top plot (η = 0.45). Each point represents performance averaged over 300 trials.
between 20 and 30 percent devoted for testing—seems to give good performance for all s. Similar results held in the experiments depicted by Figure 6. Here the nearest-neighbor target function was fixed to have s = 2 exemplars (with different labels), but we varied the classification noise rate η.11 Again the cross-validation test set was used to determine the number of backpropagation training epochs, and in Figure 6 we plot the resulting generalization error as a function of the training-test split for many different values of the η. Thus, in this case we are increasing the complexity of the data by adding noise. Despite the philosophical differences between adding deterministic complexity and adding noise, we see that the outcome is similar and again provides coarse confirmation of the theory.
11
Note that for the s = 2 case, the target function is actually linearly separable.
1160
Michael Kearns
9 Conclusions In summary, our theory predicts that although significant quantitative differences in the behavior of cross validation may arise for different model selection problems, the properties 1 through 4 should be present in a wide range of problems. At the very least, the behavior of our bounds exhibits these properties for a wide range of problems. It would be interesting to try to identify natural problems for which one or more of these properties is strongly violated; a potential source for such problems may be those for which the underlying learning curve deviates from classical power law behavior (Seung et al., 1992; Haussler et al., 1994). Perhaps the greatest weakness of the results presented here is in our failure to treat multifold cross validation. The difficulty lies in our apparent need for independence between the training and test sets to apply our methods. We leave the extension to multifold approaches for future research.
Acknowledgments I give warm thanks to Yishay Mansour, Andrew Ng, and Dana Ron for many enlightening conversations on cross validation and model selection. Additional thanks to Andrew Ng for his help in conducting the experiments.
References Barron, A. (1990). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory 19:930–944. Barron, A. R., & Cover, T. M. (1991). Minimum complexity density estimation. IEEE Transactions on Information Theory 37:1034–1054. Devroye, L. (1988). Automatic pattern recognition: A study of the probability of error. IEEE Trans. Pattern Anal. Mach. Intell. 10(4):530–543. Haussler, D., Kearns, M., Seung, H. S., & Tishby, N. (1994). Rigourous learning curve bounds from statistical mechanics. In Proceedings of the Seventh Annual ACM Confernce on Computational Learning Theory (pp. 76–87). Kearns, M., Mansour, Y., Ng, A., & Ron, D. (1995). An experimental and theoretical comparison of model selection methods. In Proceedings of the Eighth Annual ACM Conference on Computational Learning Theory. Rissanen, J. (1989). Stochastic complexity in statistical inquiry. Singapore: World Scientific. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature 323:533–536. Seung, H. S., Sompolinsky, H., & Tishby, N. (1992). Statistical mechanics of learning from examples. Physical Review A 45:6056–6091. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society B 36:111–147.
Bound on the Error of Cross Validation
1161
Stone, M. (1977). Asymptotics for and against cross-validation. Biometrika 64(1):29–35. Vapnik, V. N. (1982). Estimation of dependences based on empirical data. New York: Springer-Verlag. Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications 16(2):264–280.
Received January 4, 1996; accepted October 1, 1996.
Communicated by Leo Breiman
Averaging Regularized Estimators Michiaki Taniguchi Volker Tresp Siemens AG, Central Research, 81730 Munich, Germany
We compare the performance of averaged regularized estimators. We show that the improvement in performance that can be achieved by averaging depends critically on the degree of regularization which is used in training the individual estimators. We compare four different averaging approaches: simple averaging, bagging, variance-based weighting, and variance-based bagging. In any of the averaging methods, the greatest degree of improvement—if compared to the individual estimators—is achieved if no or only a small degree of regularization is used. Here, variance-based weighting and variance-based bagging are superior to simple averaging or bagging. Our experiments indicate that better performance for both individual estimators and for averaging is achieved in combination with regularization. With increasing degrees of regularization, the two bagging-based approaches (bagging and variance-based bagging) outperform the individual estimators, simple averaging, and variance-based weighting. Bagging and variance-based bagging seem to be the overall best combining methods over a wide range of degrees of regularization. 1 Introduction Several authors have noted the advantages of averaging estimators that were trained on either identical training data (Perrone, 1993) or bootstrap samples of the training data (a procedure termed bagging predictors by Breiman, 1994). Both theory and experiments show that averaging helps most if the errors in the individual estimators are not positively correlated and if the estimators have only small bias. On the other hand, it is well known from theory and experiment that the best performance of a single predictor is typically achieved if some form of regularization (weight decay), early stopping, or pruning is used. All three methods tend to decrease variance and increase bias of the estimator. Therefore, we expect that the optimal degrees of regularization for a single estimator and for averaging would not necessarily be the same. In this article, we investigate the effect of regularization on averaging. In addition to simple averaging and bagging, we also perform experiments using combining principles where the weighting functions are dependent on the input. The weighting functions Neural Computation 9, 1163–1178 (1997)
c 1997 Massachusetts Institute of Technology °
1164
Michiaki Taniguchi and Volker Tresp
can be derived by estimating the variance of each estimator for a given input (variance-based weighting, Tresp & Taniguchi, 1995; variance-based bagging, Taniguchi & Tresp, 1995). In the next section we derive some fundamental equations for averaging biased and unbiased estimators. In section 3 we show how the theory can be applied to regression problems, and we introduce the different averaging methods used in the experiments described in section 4. In section 5 we discuss the results, and in section 6 we present conclusions. 2 A Theory of Combining Biased Estimators 2.1 Optimal Weights for Combining Biased Estimators. We would like to estimate the unknown variable t based on the realizations of a set of M . The expected squared error between fi and t, random variables { fi }i=1 E( fi − t)2
=
E( fi − mi + mi − t)2
=
E( fi − mi )2 + E(mi − t)2 + 2E(( fi − mi )(mi − t))
=
vari + b2i ,
decomposes into the variance vari = E( fi − mi )2 , and the square of the bias bi = mi − t with mi = E( fi ). E(.) stands for the expected value. Note that E[( fi − mi )(mi − t)] = (mi − t)E( fi − mi ) = 0. In the following we are interested in estimating t by forming a linear combination of the fi , tˆ =
M X
gi fi = g0 f,
i=1
where f = ( f1 , . . . , fM )0 and the weighting vector g = (g1 , . . . , gM )0 . The expected error of the combined system is (Meir, 1995) E(tˆ − t)2
=
E(g0 f − E(g0 f ))2 + E(E(g0 f ) − t)2
=
E(g0 ( f − E( f )))2 + E(g0 m − t)2
=
g0 Äg
+
(g0 m
−
(2.1)
t)2 ,
where Ä is an M × M covariance matrix with Äij = E[( fi − mi )( fj − mj )] and with m = (m1 , . . . , mM )0 . The expected error of the combined system is minimized for1 g∗ = (mm0 + Ä)−1 tm. 1 Interestingly, even if the estimators are unbiased, that is, m = t ∀i = 1, . . . , M, the i minimum error estimator is biased, which confirms that a biased estimator can have a
Averaging Regularized Estimators
1165
2.2 Constraints. A commonly used constraint, which we also use in our experiments, is that M X
gi = 1,
gi ≥ 0, i = 1, . . . , M.
i=1
In the following, g can be written as g = (u0 h)−1 h,
(2.2)
where u = (1, . . . , 1)0 is an M-dimensional vector of ones, h = (h1 , . . . , hM )0 , and hi > 0, ∀i = 1, . . . , M. The constraint can be enforced in minimizing equation 2.1 by using the Lagrangian function, L = g0 Äg + (g0 m − t)2 + µ(g0 u − 1), with Lagrange multiplier µ. The optimum is achieved if we set (Tresp & Taniguchi, 1995) h∗ = [Ä + (m − tu)(m − tu)0 ]−1 u. Now the individual biases (mi − t) appear explicitly in the weights. For the optimal weights, E(tˆ − t)2 =
1 . u0 (Ä + (m − tu)(m − tu)0 )−1 u
Note that by using the constraint in equation 2.2, the combined estimator is unbiased if the individual estimators are unbiased, which is the main reason for employing the constraint. With unbiased estimators we obtain h∗ = Ä−1 u, and for the optimal weights, E(tˆ − t)2 = (u0 Ä−1 u)−1 . If in addition the individual estimators are uncorrelated, we obtain h∗i =
1 . vari
smaller expected error than an unbiased estimator. For example, consider the case that M = 1 and m1 = t (no bias). Then g∗ = t2 /(t2 + var1 ). This term is smaller than one if t 6= 0 and var1 > 0. Then, E(tˆ) = E(g∗ f1 ) < m1 ; that is, the minimum expected error estimator is biased.
1166
Michiaki Taniguchi and Volker Tresp
3 Averaging Regularized Estimators 3.1 Training. The previous theory can be applied to the problem of function estimation. Let us assume we have a training data set L = {(xk , yk )}Kk=1 , xk ∈
K X
(yk − fi (xk ))2 + λ
k=1
J X
w2ij ,
i = 1, . . . , M,
(3.1)
j=1
J
where {wij }j=1 are the weights in the ith neural estimator and J is the number of weights in each network. The first term is the squared error between the prediction of the neural network and the target in the training data L, and the second term is a weight-decay penalty weighted by the regularization parameter λ ≥ 0. Weight decay is commonly used to improve network performance by decreasing variance in prediction for the cost of introducing bias. It is obvious that averaging is useful only if the individual estimators differ in their prediction. Neural networks trained on an identical data set vary only because the optimization was initialized with different random initial weights and the optimization procedure terminates in different local minima. To decorrelate the individual estimators further, Breiman (1994) suggested training the estimators using bootstrap replicates LB1 , . . . , LBM , a procedure Breiman calls bagging predictors. The bootstrap replicate LBi is generated by randomly sampling K-times from the original training data set L with replacement. For background on bootstrap techniques, see Efron and Tibshirani (1993). In the experiment we considered combined estimators, which can be written as tˆ(x) =
M X i=1
gi (x) fi (x) =
M 1 X hi (x) fi (x), n(x) i=1
(3.2)
PM where n(x) = i=1 hi (x) is the normalizing factor and hi (x) ≥ 0, ∀i = 1, . . . , M. Note that we allow for the possibility that the weighting func2 In the experiments, we used standard multilayer Perceptrons with one hidden layer. For details, see section 4.
Averaging Regularized Estimators
1167
tions gi (x) depend on the input x. Also, note that it follows that (compare section 2.2) M X
gi (x) = 1,
gi (x) ≥ 0, i = 1, . . . , M.
i=1
In the experiments, we compare the performances of the combined systems for different choices of gi (x). Since in equation 3.2 averaging is performed for a fixed input x, we can apply the theory of section 2 to calculate the optimal weighting functions. Unfortunately, it is not straightforward to obtain reliable estimates of the covariance matrices and the biases. We therefore have to rely on experiments to decide if averaging is useful and in which cases which averaging method should be employed. In the experiments, we use four different combining methods, motivated by the theoretical results in section 2. The four combining methods are described in the following sections. 3.2 Simple Averaging (AV). In simple averaging, we set hi (x) = 1,
i = 1, . . . , M for all x.
It is easy to come up with examples where simple averaging is optimal— for example, if all estimators are unbiased and uncorrelated with identical variances or in general when symmetry indicates that no single estimator should be preferred. Past experiments have shown that simple averaging can improve performance considerably. For experimental results and theoretical background of simple averaging, see Perrone (1993), Jacobs (1995), Krogh and Vedelsby (1995), and Wolpert (1992). 3.3 Bagging (BA). The only difference to AV is that the individual networks are trained on bootstrap replicates. Set again hi (x) = 1,
i = 1, . . . , M for all x.
Although the individual estimators are less correlated, we expect that each individual estimator contains more variance (and possibly more bias) because the training set contains fewer distinct data if compared to the case where each estimator is trained on the complete data set. 3.4 Variance-Based Weighting (VW). In variance-based weighting we set hi (x) =
1 , var( fi (x))
i = 1, . . . , M.
The intuitive idea of the variance-based approach is that when an estimator is uncertain about its own prediction for a certain input, then this estimator is not competent for this input and obtains a low weight. From our
1168
Michiaki Taniguchi and Volker Tresp
theoretical considerations, it is clear that this is optimal if the errors in the individual estimators are uncorrelated and the estimators are unbiased. The errors of estimators trained on identical data or on bootstrap replicates are very likely correlated and weight decay introduces bias such that— from a theoretical point of view—variance-based weighting is not optimal. Again, experimental results have to show under which conditions variance-based weighting is useful. Also, for calculation of the variance, it is important to be clear about which random process we average. If the expected value is taken over all random initial weights, the variances of all estimators are identical and we would obtain simple averaging. On the other hand, if we consider that each estimator has found a different local minimum, we can consider each estimator as a distinct model. Now we only average over the noise on the targets. We consider that the targets were generated by the random process yk = t(xk ) + γ , where t(xk ) = E(yk |xk ) and γ is independent zero-mean noise with variance σ 2. The variance of an individual predictor for input x can be estimated by a number of different methods (Tibshirani, 1994). We use var( fi (x)) ≈ σ 2 θi (x)T Hi−1 θi (x), where θi (x) = ∂ fi (x)/∂wi is the output sensitivity of the neural estimator fi (x) with respect to the weights wi at the input x, and where wi = (wi1 , . . . , wiJ )0 are the weights in the ith estimator. Hi is the Hessian, which can be approximated (for the unregularized estimator) as Hi ≈
T K X ∂ fi (xk ) ∂ fi (xk ) k=1
∂wi
∂wi
.
(3.3)
xk is the kth sample in the training data set L. 3.5 Variance-Based Bagging (VB). The only difference to VW is that M of the the neural networks are trained on the bootstrap replicates {LBi }i=1 original training set L. The Hessian matrix in equation 3.3 is now calculated using the bootstrap samples. 4 Experiments In this section, we present experimental results using two real-world data sets: the Breast Cancer data and the DAX data.3 The first data set can be obtained from the University of California, Irvine repository (ftp:ics.uci.edu/ 3 The DAX (Deutschen Aktien Index) represents the average value of the stock of a set of representative companies, analogous to the Dow Jones Index in the United States.
Averaging Regularized Estimators
1169
pub/machine-learning-databases). The second data set can be obtained by contacting the authors. We compare the four different combining methods described in the previous sections using these two databases with varying degrees of regularization. In the experiments M = 25 neural networks were combined. All neural networks had the same fixed architecture: a Multilayer Perceptron with a single hidden layer of 10 hidden units. The initial weights of the estimators were chosen randomly out of a uniform distribution between −0.2 and +0.2. We divided the databases randomly in two independent sets: the training set L and the test set T. For AV and VW, each neural estimator was trained on the whole training set. For the bagging-based approaches BA and VB, each estimator was trained on the bootstrap replicates LBi of the training set L. For training, a quasi-Newton-method with a fixed number of iterations (400 for the Breast Cancer data and 300 for the DAX data) was used. To obtain statistically significant results, we repeated each experiment 10 times (R = 10 runs) for both data sets. For the Breast Cancer data with relatively few data, we chose a different division into test data and training data for each run. 4.1 Performance Criteria. The averaged summed squared error ASSEC (λ) is the squared test set error of the individual estimators trained on complete data and with weight decay parameter λ averaged over all M estimators and averaged over all R runs. ASSEB (λ) is the equivalent measure for networks trained on bootstrap samples. ASSEcomb (λ) with comb ∈ {AV, BA, VW, VB} is the squared test set error of the combining methods with weight decay parameter λ averaged over all R runs. Furthermore, we define ASTDC (λ) as the averaged standard deviation of the prediction of the neural networks trained with complete data (averaged over all estimators and all runs). ASTDB (λ) is the equivalent measure for networks trained on bootstrap samples. The mathematical formulas describing the different measures can be found in appendix A. 4.2 Breast Cancer Data. The database has been recorded at the University of Wisconsin Hospitals, Madison. The data set contains 699 samples with 9 input variables consisting of cellular characteristics and one binary output with 458 benign and 241 malignant cases. All input variables were normalized to zero mean and a standard deviation of one. For every run we divided the database randomly in K = 599 training samples and P = 100 test samples. Figure 1 (top) shows the averaged performance of the individual estimators as a function of the regularization parameter λ. The large values of both ASSEC (λ) and ASSEB (λ) for λ = 0 indicate that neural networks without regularization extremely overfit the data. For λ ≈ 1.25 the networks trained
1170
Michiaki Taniguchi and Volker Tresp
8
7
6
ASSE
5
4
3
2
1 0
0.5
1
1.5 lambda
2
2.5
3
0.14
0.12
ASTD
0.1
0.08
0.06
0.04
0.02
0 0
0.5
1
1.5 lambda
2
2.5
3
Figure 1: (Top) ASSEC (solid line) and ASSEB (dash-dotted line) as a function of λ for the Breast Cancer data. For λ = 0, ASSEC = 746 and ASSEB = 133 are out of scale. (Bottom) ASTDC (solid line) and ASTDB (dash-dotted line) as a function of λ for the Breast Cancer data. For λ = 0, ASTDC = 0.49 and ASTDB = 0.55 are out of scale.
on complete data obtain best performance. ASSEB (λ) reaches a minimum at λ = 2. As expected, ASSEB (λ) is always larger than ASSEC (λ) since the networks trained on bootstrap replicates have seen fewer distinct data. The difference between ASSEB (λ) and ASSEC (λ) decreases with increasing λ. Figure 1 (bottom) shows ASTDC (λ) and ASTDB (λ). Both decrease monotonously with growing λ, and ASTDB (λ) is consistently larger. Even for large λ, ASTDB (λ) is still greater than 0.02, whereas ASTDC (λ) is close to
Averaging Regularized Estimators
1171
zero. The figure clearly demonstrates that bagging increases variance in the prediction. The test set performances of the different combination methods are plotted in Figure 2 (top). With no regularization, all averaging methods show dramatically better performance if compared to the individual estimators. The variance-based approaches VW and VB are both better than the other averaging methods if λ is close to zero. This result seems to indicate that the variance-based methods successfully recognize the large variance in networks due to local overtraining. With increasing λ, the individual networks, as well as all averaging methods, improve performance. With increasing λ, the performances of AV and VW—the approaches in which the networks were trained on complete data—converge to ASSEC (λ). In particular for 0.3 < λ < 1 the performance of bagging is impressive and is superior to AV and VW. Most striking, however, VB shows very good performance for any λ and seems to combine the advantages of both VW and BA. Note that the optimum for the individual estimators is at a larger degree of regularization than the optimum of the bagging approaches BA and VB. For AV and VW, best performance coincides with the best performance of the averaged individual networks trained on complete data. Figure 2 (bottom) shows the number of single estimators that were worse than the averaging methods in percentages. The impressive performance of VB is also apparent: In the large majority of settings for λ, VB is better than all individual networks. Bagging shows excellent performance up to λ ≈ 1, and AV and VW are best for small values of λ <≈ 0.5. 4.3 DAX Data. The DAX data consist of 2564 samples from March 5, 1984, to December 30, 1993. The goal is to predict the value of the DAX of the following day based on past measurements of the DAX. As input variables, 12 indicators were calculated (see appendix B). All inputs and the output were normalized with zero mean and variance of one. The training data set consisted of K = 2150 samples (from March 5, 1984 to May 29, 1992), and the test set consisted of P = 414 test samples (from June 1, 1992 to December 30, 1993). Figure 3 (top) shows ASSEC (λ) and ASSEB (λ) as a function of the regularization parameter λ. The graph shows qualitatively the same characteristics as in the previous experiment. Again, for small λ, we see overfitting. ASSEC (λ) is optimal at λ ≈ 30, and ASSEB (λ) is optimal at λ ≈ 50. Again, ASSEB (λ) is always larger than ASSEC (λ) since the networks trained on bootstrap replicates have seen fewer distinct data. Figure 3 (bottom) shows the average standard deviation of the prediction (ASTDC (λ), ASTDB (λ)). Again, for increasing λ, ASTDC (λ) quickly approaches zero, whereas ASTDB (λ) still assumes relative large values. The test set performances of the different combination methods is plotted in Figure 4 (top and center). With no regularization, all averaging methods
1172
Michiaki Taniguchi and Volker Tresp
3
2.9
AV BA
2.8
VW
ASSE^B
VB 2.7
ASSE
ASSE^C 2.6
2.5
2.4
2.3
2.2 0
0.5
1
1.5 lambda
2
2.5
3
0.5
1
1.5 lambda
2
2.5
3
100
90
Percentage
80
70
60
50
40 0
Figure 2: (Top) ASSEcomb (λ) of the different averaging approaches for the Breast Cancer data. Displayed is AV (dashed line), BA (dotted line), VW (dash-dotted line), and VB (lower solid line). The highest continuous line shows the average performance of the individual bagging networks, and the second highest line shows the average performance of the networks trained on the complete data. (Bottom) The number of single estimators trained on complete data, which are worse than the averaging methods for the Breast Cancer data (in percentages).
show much better performance if compared to the individual estimators. The variance-based approaches VW and VB are slightly better than the other averaging methods at λ = 0. Interestingly, with increasing λ, the performances of the averaging methods first decrease but have a global optimum for intermediate values of λ. This can be explained by the drastic improvement of the individual estimators with optimal λ. All averaging methods show excellent performance for 30 < λ < 40. Best performance
Averaging Regularized Estimators
1173
280
270
260
ASSE
250
240
230
220
210 0
10
20
30
40
50 lambda
60
70
80
90
100
30
40
50 lambda
60
70
80
90
100
0.4
0.35
0.3
ASTD
0.25
0.2
0.15
0.1
0.05
0 0
10
20
Figure 3: (Top) ASSEC (solid line) and ASSEB (dash-dotted line) as a function of λ for the DAX data. (Bottom) ASTDC (solid line) and ASTDB (dash-dotted line) as a function of λ for the DAX data.
is achieved for VB at λ = 40. Note that for λ > 50, AV and VW are not noticably better than the average of the individual networks, whereas BA and VB are still considerably better than the mean of the networks trained on bootstrap samples. Figure 4 (bottom) shows the number of single estimators that were worse than the averaging methods in percentage. Apparent is the impressive performance of all averaging methods if no regularization is used. For λ < 50, all averaging methods are better than 65 percent of all individual estimators. For λ > 50, BA and VB are considerably better than the individual estimators, whereas the improvement achieved with AV and VW decreases quickly.
1174
Michiaki Taniguchi and Volker Tresp
219
218
ASSE
217
216
215
214
213 0
0.5
1
1.5
2
2.5
lambda
219
AV
218
BA VW
217
VB
ASSE
216
215
ASSE^B
214 ASSE^C 213
212
211 0
10
20
30
40
50 lambda
60
70
80
90
100
60
70
80
90
100
95 90 85
Percentage
80 75 70 65 60 55 50 45 0
10
20
30
40
50 lambda
Figure 4: (Top and center) ASSE of the different averaging approaches for the DAX data for different scales of λ. Displayed are AV (dashed line), BA (dotted line), VW (dash-dotted line), and VB (lower solid line). The highest continuous line shows the average performance of the individual bagging networks, and the second highest line shows the average performance of the networks trained on the complete data. (Bottom) The figure shows the number of single estimators trained on complete data, which are worse than the averaging methods for the DAX data (in percentage).
Averaging Regularized Estimators
1175
5 Discussion The experiments confirmed the theory but also showed some unexpected results. As predicted, the variance in the networks decreases rapidly with increasing weight-decay parameter λ if networks are trained on complete data, as shown in Figures 1 and 3. On the other hand, if networks are trained on bootstrap replicates, we obtain a large variance in the networks even at relatively large values of λ. The performance of the individual networks is worse for networks trained on bootstrap replicates since each estimator has seen a smaller number of distinct data. The relative improvement in performance by averaging increases with increasing variance in the estimates and bias hurts. Therefore for all averaging methods, the relative improvement is maximum if no weight decay is used. On the other hand, the performances of the networks improve with weight decay. So, as confirmed by experiment, regularization also improves the averaged systems. Simple averaging (AV) shows good performance at small values of λ. Even better at small λ is variance-based weighting (VW). The reason is that local overtraining is reflected in large variance. The corresponding estimator consequently obtains a small weight. With increasing λ, the performance of both AV and VW becomes comparable, and both approach the performance of an average individual estimator trained on complete data for large λ. This confirms that all estimators are highly correlated for large λ, as shown in Figures 1 and 3. Bagging (BA) displays better performance than AV and VW up to intermediate values of λ except when λ is extremely small or zero. This can be explained by the fact that training on bootstrap samples results in considerable variance in the networks even for large λ. Variance-based bagging (VB) seems to combine the advantages of both variance-based weighting and bagging. If networks are overtrained, they locally have large variance and obtain a small weight locally. Training on bootstrap replicates introduces additional variance in the networks, which is particularly useful for large λ. In our first experiment (Breast Cancer data), variance-based bagging was the overall best combining method over a wide range of degrees of regularization. In the second experiment (DAX data) BA and VB show similar performance for intermediate and large values of λ, but VB shows superior performance for small λ. 6 Conclusions Based on our experiments we can conclude that in comparison with the individual estimators, averaging improves performance at all levels of regularization. In particular we also obtain improvements with respect to optimally regularized estimators, although the degree of improvement is application specific. Averaging is less sensitive with respect to the regularization parameter λ if compared to the individual estimators. Especially if the individual estimators overfit, averaging still gives excellent performance. Overall, bag-
1176
Michiaki Taniguchi and Volker Tresp
ging and variance-based bagging, which both use networks trained with bootstrap replicates, work well for a wide range of values of λ. At extremely small values of λ, variance-based weighting and variance-based bagging are clearly superior to the other averaging approaches. Appendix A: Performance Criteria C,λ Let fi,j be the ith estimator on the test set in the jth run trained on complete training data with regularization parameter λ. The summed squared error C,λ is defined as for fi,j
SSECi,j (λ) =
P X
p
p
2
C,λ (yj − fi,j (xj )) ,
p=1 p p P {(xj , yj )}p=1
are the test data in the jth run. As performance criwhere Tj = terion for the individual estimators, we use the averaged summed squared error where we average over all M estimators and all R runs, ASSEC (λ) =
R M 1X 1 X SSECi,j (λ). R j=1 M i=1
The performance measures for networks trained on bootstrap replicates SSEBi,j (λ) and ASSEB (λ) are defined analogously. To measure the performance of combining methods we define similarly, SSEjcomb (λ) =
P X
2
(yj − tˆjλ (xj )) , p
p
p=1
where tˆjλ is the response of the combined system from equation 3.2 in the jth run for regularization parameter λ. The averaged SSE for combining systems is defined as ASSEcomb (λ) =
R 1X SSEjcomb (λ), R j=1
where comb ∈ {AV, BA, VW, VB}. The averaged standard deviation of the prediction of the neural networks trained with complete data is defined as v u R P u M X X X 1 1 p p C t1 ( f C,λ (x ) − f¯jC,λ (xj ))2 ASTD (λ) = R j=1 P p=1 M i=1 i,j j with M 1 X p p f¯jC,λ (xj ) = f C,λ (x ). M i=1 i,j j
Averaging Regularized Estimators
1177
Table A.1: Input Variables for Predicting the DAX at Day t + 1. Input Variable
Description y(t)/y(t − 1) P4 log(y(t)) − 1/5 i=0 log(y(t − i)) P9 log(y(t)) − 1/10 i=0 log(y(t − i)) (log(y(t)) − log(y(t − 5))) − (log(y(t)) − log(y(t − 6))) (log(y(t)) − log(y(t − 10))) − (log(y(t)) − log(y(t − 11))) rsi(log(y(t)), 5) rsi(log(y(t)), 10) log(y(t)) − mini=1,...,4 (ln(y(t − i))) maxi=1,...,4 (log(y(t − i))) − mini=1,...,4 (ln(y(t − i))) log(y(t)) − mini=1,...,9 (ln(y(t − i))) maxi=1,...,9 (log(y(t − i))) − mini=1,...,9 (ln(y(t − i))) P2 log(y(t − i)) − mini=1,...,4 (ln(y(t − i))) 1/3 i=0 maxi=1,...,4 (log(y(t − i))) − mini=1,...,4 (ln(y(t − i))) P2 log(y(t − i)) − mini=1,...,9 (ln(y(t − i))) 1/3 i=0 maxi=1,...,9 (log(y(t − i))) − mini=1,...,9 (ln(y(t − i))) vol(t)/vol(t − 1)
1 2 3 4 5 6 7 8 9 10 11 12
ASTD is a measure of the degree of variance between the individual networks. The corresponding measure for networks trained on bootstrap replicates ASTDB (λ) is defined analogously. Appendix B: The DAX Inputs Table A.1 describes the input variables used in the DAX data. y(t) is the DAX at day t. vol(t) is the total volume of the transactions at the German stock market at day t. Furthermore, Pn−1 b(y(t − i) − y(t − i − 1)) rsi(y(t), n) = Pi=0 n−1 i=0 |y(t − i) − y(t − i − 1)| with
½ b(y(t)) =
0 y(t)
if y(t) ≤ 0 if y(t) > 0.
For a motivation of the preprocessing and further details, see Dichtl (1995). Acknowledgments This research was partially supported by the Bundesministerium fur ¨ Bildung, Wissenschaft, Forschung und Technologie, grant number 01 IN 505 A.
1178
Michiaki Taniguchi and Volker Tresp
References Breiman, L. (1994). Bagging predictors (Tech. Rep. No. 421). Department of Statistics, University of California at Berkeley. Dichtl, H. (1995). Zur Prognose des Deutschen Aktienindex DAX mit Hilfe von Neuro-Fuzzy-Systemen. Beitr¨age zur Theorie der Finanzm¨arkte, 12, Institut fur ¨ Kapitalmarktforschung, J. W. Goethe-Universit¨at, Frankfurt. Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. London: Chapman and Hall. Jacobs, R. A. (1995). Methods for combining experts’ probability assessment. Neural Computation, 7, 867–888. Krogh, A., & Vedelsby, J. (1995). Neural network ensembles, cross validation, and active learning. In G. Tesauro, D. S. Touretzky, and T. K. Leen (Eds.), Advances in Neural Information Processing Systems 7. Cambridge, MA: MIT Press. Meir, R. (1995). Bias, variance and the combination of least squares estimators. In G. Tesauro, D. S. Touretzky, and T. K. Leen (Eds.), Advances in Neural Information Processing Systems 7. Cambridge, MA: MIT Press. Perrone, M. P. (1993). Improving regression estimates: Averaging methods for variance reduction with extensions to general convex measure optimization. Doctoral dissertation, Brown University, Providence, RI. Taniguchi, M., & Tresp, V. (1995). Variance-based combination of estimators trained by bootstrap replicates. In Proc. Inter. Symposium on Artificial Neural Networks. Taiwan: Hsinchu. Tibshirani, R. (1994). A comparison of some error estimates for neural network models (Tech. Rep.). Toronto: Department of Statistics, University of Toronto. Tresp, V., & Taniguchi, M. (1995). Combining estimators using non-constant weighting functions. In G. Tesauro, D. S. Touretzky, and T. K. Leen (Eds.), Advances in Neural Information Processing Systems 7. Cambridge, MA: MIT Press. Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5, 241–259.
Received March 27, 1996; accepted August 8, 1996.
REVIEW
Communicated by Dan Johnston
The NEURON Simulation Environment M. L. Hines Department of Computer Science and Neuroengineering and Neuroscience Center, Yale University, New Haven, CT 06520, U.S.A.
N. T. Carnevale Psychology Department, Yale University, New Haven, CT 06520, U.S.A.
The moment-to-moment processing of information by the nervous system involves the propagation and interaction of electrical and chemical signals that are distributed in space and time. Biologically realistic modeling is needed to test hypotheses about the mechanisms that govern these signals and how nervous system function emerges from the operation of these mechanisms. The NEURON simulation program provides a powerful and flexible environment for implementing such models of individual neurons and small networks of neurons. It is particularly useful when membrane potential is nonuniform and membrane currents are complex. We present the basic ideas that would help informed users make the most efficient use of NEURON. 1 Introduction NEURON (Hines, 1984, 1989, 1993, 1994) provides a powerful and flexible environment for implementing biologically realistic models of electrical and chemical signaling in neurons and networks of neurons. This article describes the concepts and strategies that have guided the design and implementation of this simulator, with emphasis on those features that are particularly relevant to its most efficient use.
1.1 The Problem Domain. Information processing in the brain results from the spread and interaction of electrical and chemical signals within and among neurons. This involves nonlinear mechanisms that span a wide range of spatial and temporal scales (Carnevale & Rosenthal, 1992) and are constrained to operate within the intricate anatomy of neurons and their interconnections. Consequently the equations that describe brain mechanisms generally do not have analytical solutions, and intuition is not a reliable guide to understanding the working of the cells and circuits of the brain. Furthermore, these nonlinearities and spatiotemporal complexities are quite unlike those that are encountered in most nonbiological systems, Neural Computation 9, 1179–1209 (1997)
c 1997 Massachusetts Institute of Technology °
1180
M. L. Hines and N. T. Carnevale
so the utility of many quantitative and qualitative modeling tools that were developed without taking these features into consideration is severely limited. NEURON is designed to address these problems by enabling both the convenient creation of biologically realistic quantitative models of brain mechanisms and the efficient simulation of the operation of these mechanisms. In this context the term biological realism does not mean “infinitely detailed.” Instead, it means that the choice of which details to include in the model and which to omit are at the discretion of the investigator who constructs the model, and not forced by the simulation program. To the experimentalist, NEURON offers a tool for cross-validating data, estimating experimentally inaccessible parameters, and deciding whether known facts account for experimental observations. To the theoretician, it is a means for testing hypotheses and determining the smallest subset of anatomical and biophysical properties that is necessary and sufficient to account for particular phenomena. To the student in a laboratory course, it provides a vehicle for illustrating and exploring the operation of brain mechanisms in a simplified form that is more robust than the typical “wet lab” experiment. For the experimentalist, theoretician, and student alike, a powerful simulation tool such as NEURON can be an indispensable aid to developing the insight and intuition that is needed if one is to discover the order hidden within the intricacy of biological phenomena, the order that transcends the complexity of accident and evolution. 1.2 Experimental Advances and Quantitative Modeling. Experimental advances drive and support quantitative modeling. Over the past two decades, the field of neuroscience has seen striking developments in experimental techniques, among them the following: • High-quality electrical recording from neurons in vitro and in vivo using patch clamp. • Multiple impalements of visually identified cells. • Simultaneous intracellular recording from paired pre- and postsynaptic neurons. • Simultaneous measurement of electrical and chemical signals. • Multisite electrical and optical recording. • Quantitative analysis of anatomical and biophysical properties from the same neuron. • Photolesioning of cells. • Photorelease of caged compounds for spatially precise chemical stimulation.
The NEURON Simulation Environment
1181
• New drugs such as channel blockers and receptor agonists and antagonists. • Genetic engineering of ion channels and receptors. • Analysis of messenger RNA and biophysical properties from the same neuron. • “Knockout” mutations. These and other advances are responsible for impressive progress in the definition of the molecular biology and biophysics of receptors and channels; the construction of libraries of identified neurons and neuronal classes that have been characterized anatomically, pharmacologically, and biophysically; and the analysis of neuronal circuits involved in perception, learning, and sensorimotor integration. The result is a data avalanche that catalyzes the formulation of new hypotheses of brain function, while at the same time serving as the empirical basis for the biologically realistic quantitative models that must be used to test these hypotheses. Following are some examples from the large list of topics that have been investigated through the use of such models: • The cellular mechanisms that generate and regulate chemical and electrical signals (Destexhe, Contreras, Steriade, Sejnowski, & Huguenard, 1996; Jaffe et al., 1994). • Drug effects on neuronal function (Lytton & Sejnowski, 1992). • Presynaptic (Lindgren & Moore, 1989) and postsynaptic (Destexhe & Sejnowski, 1995; Traynelis, Silver, & Cull-Candy, 1993) mechanisms underlying communication between neurons. • Integration of synaptic inputs (Bernander, Douglas, Martin, & Koch, 1991; Cauller & Connors, 1992). • Action potential initiation and conduction (H¨ausser, Stuart, Racca, & Sakmann, 1995; Hines & Shrager, 1991; Mainen, Joerges, Huguenard, & Sejnowski, 1995). • Cellular mechanisms of learning (Brown, Zador, Mainen, & Claiborne, 1992; Tsai, Carnevale, & Brown, 1994a). • Cellular oscillations (Destexhe, Babloyantz, & Sejnowski, 1993a; Lytton, Destexhe, & Sejnowski, 1996). • Thalamic networks (Destexhe, McCormick, & Sejnowski, 1993b; Destexhe, Contreras, Sejnowski, & Steriade, 1994). • Neural information encoding (Hsu et al., 1993; Mainen & Sejnowski, 1995; Softky, 1994).
1182
M. L. Hines and N. T. Carnevale
2 Overview of NEURON NEURON is intended to be a flexible framework for handling problems in which membrane properties are spatially inhomogeneous and where membrane currents are complex. Since it was designed specifically to simulate the equations that describe nerve cells, NEURON has three important advantages over general-purpose simulation programs. First, the user is not required to translate the problem into another domain, but instead is able to deal directly with concepts that are familiar at the neuroscience level. Second, NEURON contains functions that are tailored specifically for controlling the simulation and graphing the results of real neurophysiological problems. Third, its computational engine is particularly efficient because of the use of special methods and tricks that take advantage of the structure of nerve equations (Hines, 1984; Mascagni, 1989). The general domain of nerve simulation, however, is still too large for any single program to deal optimally with every problem. In practice, each program has its origin in a focused attempt to solve a restricted class of problems. Both speed of simulation and the ability of the user to maintain conceptual control degrade when any program is applied to problems outside the class for which it is best suited. NEURON is computationally most efficient for problems that range from parts of single cells to small numbers of cells in which cable properties play a crucial role. In terms of conceptual control, it is best suited to tree-shaped structures in which the membrane channel parameters are approximated by piecewise linear functions of position. Two classes of problems for which it is particularly useful are those in which it is important to calculate ionic concentrations and those where one needs to compute the extracellular potential just next to the nerve membrane. It is especially capable for investigating new kinds of membrane channels since they are described using a high-level neuron model description language (NMODL) (Moore & Hines, 1996), which allows the expression of models in terms of kinetic schemes or sets of simultaneous differential and algebraic equations. To maintain efficiency, user-defined mechanisms in NMODL are automatically translated into C, compiled, and linked into the rest of NEURON. The flexibility of NEURON comes from a built-in object-oriented interpreter, which is used to define the morphology and membrane properties of neurons, control the simulation, and establish the appearance of a graphical interface. The default graphical interface is suitable for exploratory simulations involving the setting of parameters, control of voltage and current stimuli, and graphing variables as a function of time and position. Simulation speed is excellent since membrane voltage is computed by an implicit integration method optimized for branched structures (Hines, 1984). The performance of NEURON degrades very slowly with increased complexity of morphology and membrane mechanisms, and it has been applied to very large network models: 104 cells with six compartments each,
The NEURON Simulation Environment
1183
for a total of 106 synapses in the net (T. Sejnowski, personal communication, March 29, 1996). 3 Mathematical Basis Strategies for numerical solution of the equations that describe chemical and electrical signaling in neurons have been discussed in many places. Elsewhere we have briefly presented an intuitive rationale for the most commonly used methods (Hines & Carnevale, 1995). Here we start from this base and proceed to address those aspects that are most pertinent to the design and application of NEURON. 3.1 The Cable Equation. The application of cable theory to the study of electrical signaling in neurons has a long history, which is briefly summarized elsewhere (Rall, 1989). The basic computational task is to solve numerically the cable equation ∂ 2V ∂V + I(V, t) = , ∂t ∂x2
(3.1)
which describes the relationship between current and voltage in a onedimensional cable. The branched architecture typical of most neurons is incorporated by combining equations of this form with appropriate boundary conditions. Spatial discretization of this partial differential equation is equivalent to reducing the spatially distributed neuron to a set of connected compartments. The earliest example of a multicompartmental approach to the analysis of dendritic electrotonus was provided by Rall (1964). Spatial discretization produces a family of ordinary differential equations of the form cj
X vk − vj dvj + iionj = . dt rjk k
(3.2)
Equation 3.2 is a statement of Kirchhoff’s current law, which asserts that net transmembrane current leaving the jth compartment must equal the sum of axial currents entering this compartment from all sources (see Figure 1). The left-hand side of equation 3.2 is the total membrane current, which is the sum of capacitive and ionic components. The capacitive component is cj dvj /dt, where cj is the membrane capacitance of the compartment. The ionic component iionj includes all currents through ionic channel conductances. The right-hand side of equation 3.2 is the sum of axial currents that enter this compartment from its adjacent neighbors. Currents injected through a microelectrode would be added to the right-hand side. The sign conventions for current are as follows: outward transmembrane current is positive; axial
1184
M. L. Hines and N. T. Carnevale
Figure 1: The net current entering a region must equal zero.
current flow into a region is positive; positive injected current drives vj in a positive direction. Equation 3.2 involves two approximations. First, axial current is specified in terms of the voltage drop between the centers of adjacent compartments. The second approximation is that spatially varying membrane current is represented by its value at the center of each compartment. This is much less drastic than the often-heard statement that a compartment is assumed to be “isopotential.” It is far better to picture the approximation in terms of voltage varying linearly between the centers of adjacent compartments. Indeed, the linear variation in voltage is implicit in the usual description of a cable in terms of discrete electrical equivalent circuits. If the compartments are of equal size, it is easy to use Taylor’s series to show that both of these approximations have errors proportional to the square of compartment length. Thus, replacing the second partial derivative by its central difference approximation introduces errors proportional to 1x2 , and doubling the number of compartments reduces the error by a factor of four. It is often not convenient for the size of all compartments to be equal. Unequal compartment size might be expected to yield simulations that are only first-order accurate. However, comparison of simulations in which unequal compartments are halved or quartered in size generally reveals a
The NEURON Simulation Environment
1185
second-order reduction of error. A rough rule of thumb is that simulation error is proportional to the square of the size of the largest compartment. The first of two special cases of equation 3.2 that we wish to discuss allows us to recover the usual parabolic differential form of the cable equation. Consider the interior of an unbranched cable with constant diameter. The axial current consists of two terms involving compartments with the natural indices j − 1 and j + 1, that is, cj
dvj vj−1 − vj vj+1 − vj + iionj = + . dt rj−1,j rj,j+1
If the compartments have the same length 1x and diameter d, then the capacitance of a compartment is Cm π d1x, and the axial resistance is Ra 1x/π(d/2)2 . Cm is called the specific capacitance of the membrane, which is generally taken to be 1 µf/cm2 . Ra is the axial resistivity, which has different reported values for different cell classes (e.g., 35.4 Ä cm for squid axon). Equation 3.2 then becomes Cm
dvj d vj+1 − 2vj + vj−1 + ij = , dt 4Ra 1x2
where we have replaced the total ionic current iionj with the current density ij . The right-hand term, as 1x → 0, is just ∂ 2 v/∂x2 at the location of the now-infinitesimal compartment j. The second special case of equation 3.2 allows us to recover the boundary condition. This is an important exercise since naive discretizations at the ends of the cable have destroyed the second-order accuracy of many simulations. Nerve boundary conditions are that no axial current flows at the end of the cable; the end is sealed. This is implicit in equation 3.2, where the right-hand side consists of only the single term (vj−1 − vj )/rj−1,j when compartment j lies at the end of an unbranched cable. 3.2 Spatial Discretization in a Biological Context: Sections and Segments. Every nerve simulation program solves for the longitudinal spread of voltage and current by approximating the cable equation as a series of compartments connected by resistors (see Figure 4 and equation 3.2). The sum of all the compartment areas is the total membrane area of the whole nerve. Unfortunately, it is usually not clear at the outset how many compartments should be used. Both the accuracy of the approximation and the computation time increase as the number of compartments used to represent the cable increases. When the cable is “short,” a single compartment can be made to represent the entire cable adequately. For long cables or highly branched structures, it may be necessary to use a large number of compartments. This raises the question of how best to manage all the parameters that exist within these compartments. Consider membrane capacitance, which
1186
M. L. Hines and N. T. Carnevale
has a different value in each compartment. Rather than specify the capacitance of each compartment individually, it is better to deal in terms of a single specific membrane capacitance that is constant over the entire cell and have the program compute the values of the individual capacitances from the areas of the compartments. Other parameters such as diameter or channel density may vary widely over short distances, so the graininess of their representation may have little to do with numerically adequate compartmentalization. Although NEURON is a compartmental modeling program, the specification of biological properties (neuron shape and physiology) has been separated from the numerical issue of compartment size. What makes this possible is the notion of a section, which is a continuous length of unbranched cable. Although each section is ultimately discretized into compartments, values that can vary with position along the length of a section are specified in terms of a continuous parameter that ranges from 0 to 1 (normalized distance). In this way, section properties are discussed without regard to the number of segments used to represent it. This makes it easy to trade off between accuracy and speed, and enables convenient verification of the numerical correctness of simulations. Sections are connected together to form any kind of branched tree structure. Figure 2 illustrates how sections are used to represent biologically significant anatomical features. The top of this figure is a cartoon of a neuron with a soma that gives rise to a branched dendritic tree and an axon hillock connected to a myelinated axon. Each biologically significant component of this cell has its counterpart in one of the sections of the NEURON model, as shown in the bottom of Figure 2: the cell body (Soma), axon hillock (AH), myelinated internodes (Ii ), nodes of Ranvier (Ni ), and dendrites (Di ). Sections allow this kind of functional and anatomical parcellation of the cell to remain foremost in the mind of the person who constructs and uses a NEURON model. To accommodate requirements for numerical accuracy, NEURON represents each section by one or more segments of equal length (see Figures 3 and 4). The number of segments is specified by the parameter nseg, which can have a different value for each section. At the center of each segment is a node, the location where the internal voltage of the segment is defined. The transmembrane currents over the entire surface area of a segment are associated with its node. The nodes of adjacent segments are connected by resistors. It is crucial to realize that the location of the second-order correct voltage is not at the edge of a segment but rather at its center, that is, at its node. This is the discretization method employed by NEURON. To allow branching and injection of current at the precise ends of a section while maintaining second-order correctness, extra voltage nodes that represent compartments with zero area are defined at the section ends. It is possible to achieve second-order accuracy with sections whose end nodes have
The NEURON Simulation Environment
1187
Figure 2: (Top) Cartoon of a neuron indicating the approximate boundaries between biologically significant structures. The left-hand side of the cell body (Soma) is attached to an axon hillock (AH) that drives a myelinated axon (myelinated internodes Ii alternating with nodes of Ranvier Ni ). From the right-hand side of the cell body originates a branched dendritic tree (Di ). (Bottom) How sections would be employed in a NEURON model to represent these structures.
nonzero area compartments. However, the areas of these terminal compartments would have to be exactly half that of the internal compartments, and extra complexity would be imposed on administration of channel density at branch points. Based on the position of the nodes, NEURON calculates the values of internal model parameters such as the average diameter, axial resistance, and compartment area that are assigned to each segment. Figures 3 and 4 show how an unbranched portion of a neuron, called a neurite (see Figure 3A), is represented by a section with one or more segments. Morphometric analysis generates a series of diameter measurements whose centers lie on the midline of the neurite (the thin axial line in Figure 3B). These measurements and the path lengths between their centers are the dimensions of the section, which can be regarded as a chain of truncated cones or frusta (see Figure 3C). Distance along the length of a section is discussed in terms of the normalized position parameter x. That is, one end of the section corresponds to
1188
M. L. Hines and N. T. Carnevale
Figure 3: (A) Cartoon of an unbranched neurite (thick lines) that is to be represented by a section in a NEURON model. Computer-assisted morphometry generates a file that stores successive measurements of diameter (thin circles) centered at x, y, z coordinates (thin crosses). (B) Each adjacent pair of diameter measurements (thick circles) becomes the parallel faces of a truncated cone or frustum, the height of which is the distance between the measurement locations. The outline of each frustum is shown with thin lines, and a thin centerline passes through the central axis of the chain of solids. (C) The centerline has been straightened so the faces of adjacent frusta are flush with each other. The scale underneath the figure shows the distance along the midline of the section in terms of the normalized position parameter x. The vertical dashed line at x = 0.5 divides the section into two halves of equal length. (D) Electrical equivalent circuit of the section as represented by a single segment (nseg = 1). The open rectangle includes all mechanisms for ionic (noncapacitive) transmembrane currents.
x = 0 and the other end to x = 1. In Figure 3C these locations are depicted as being on the left- and right-hand ends of the section. The locations of the nodes and the boundaries between segments are conveniently specified
The NEURON Simulation Environment
1189
Figure 4: How the neurite of Figure 3 would be represented by a section with two segments (nseg = 2). Now the electrical equivalent circuit (bottom) has two nodes. The membrane properties attached to the first and second nodes are based on neurite dimensions and biophysical parameters over the x intervals [0, 0.5] and [0.5, 1], respectively. The three axial resistances are computed from the cytoplasmic resistivity and neurite dimensions over the x intervals [0, 0.25], [0.25, 0.75], and [0.75, 1].
in terms of this normalized position parameter. In general, a section has nseg segments that are demarcated by evenly spaced boundaries at intervals of 1/nseg. The nodes at the centers of these segments are located at x = (2i − 1)/2 nseg where i is an integer in the range [1, nseg]. As we shall see later, x is also used in specifying model parameters or retrieving state variables that are a function of position along a section (see section 4.4). The special importance of x and nseg lies in the fact that they free the user from having to keep track of the correspondence between segment number and position on the nerve. In early versions of NEURON, all nerve properties were stored in vector variables where the vector index was the segment number. Changing the number of segments was an error-prone and laborious process that demanded a remapping of the relationship between the user’s mental image of the biologically important features of the model, on the one hand, and the implementation of this model in a digital computer, on the other. The use of x and nseg insulates the user from the most inconvenient aspects of such low-level details.
1190
M. L. Hines and N. T. Carnevale
When nseg = 1 the entire section is lumped into a single compartment. This compartment has only one node, which is located midway along its length, at x = 0.5 (see Figures 3C and D). The integral of the surface area over the entire length of the section (0 ≤ x ≤ 1) is used to calculate the membrane properties associated with this node. The values of the axial resistors are determined by integrating the cytoplasmic resistivity along the paths from the ends of the section to its midpoint (dashed line in Figure 3C). The left- and right-hand axial resistances of Figure 3D are evaluated over the x intervals [0, 0.5] and [0.5, 1], respectively. Figure 4 shows what happens when nseg = 2. Now NEURON breaks the section into two segments of equal length that correspond to x intervals [0, 0.5] and [0.5, 1]. The membrane properties over these intervals are attached to the nodes at 0.25 and 0.75, respectively. The three axial resistors Ri1 , Ri2 , and Ri3 are determined by integrating the path resistance over the x intervals [0, 0.25], [0.25, 0.75], and [0.75, 1]. 3.3 Integration Methods. Spatial discretization reduced the cable equation, a partial differential equation with derivatives in space and time, to a set of ordinary differential equations with first-order derivatives in time. Selection of an integration method to solve these equations is guided by concerns of stability, accuracy, and efficiency (Hines & Carnevale, 1995). NEURON offers the user a choice of two stable implicit integration methods: backward Euler and a variant of Crank-Nicholson (C-N). Backward Euler is the default because of its robust numerical stability properties. Backward Euler can be used with extremely large time steps in order to find the steady-state solution for a linear (“passive”) system. It produces good qualitative results even with large time steps, and it works even if some or all of the equations are strictly algebraic relations among states. A more accurate method for small time steps is available by setting the global parameter secondorder to 2. NEURON then uses a variant of the C-N method, in which numerical error is proportional to 1t2 . Both of these are implicit integration methods, in which all current balance equations must be solved simultaneously. The backward Euler algorithm does not resort to iteration to deal with nonlinearities, since its numerical error is proportional to 1t anyway. The special feature of the C-N variant is its use of a staggered time step algorithm to avoid iteration of nonlinear equations (see section 3.3.1). This converts the current balance part of the problem to one that requires only the solution of simultaneous linear equations. Although the C-N method is formally stable, it is sometimes plagued by spurious large-amplitude oscillations (see Hines and Carnevale, 1995, Figure 7). This occurs when 1t is too large, as may occur in models that involve fast voltage clamps or have compartments coupled by very small resistances. However, C-N is safe in most situations, and it can be much more efficient than backward Euler for a given accuracy.
The NEURON Simulation Environment
1191
These two methods are almost identical in terms of computational cost per time step (see section 3.3.1). Since the current balance equations have the structure of a tree (there are no current loops), direct gaussian elimination is optimal for their solution (Hines, 1984). This takes exactly the same number of computer operations as would be required for an unbranched cable with the same number of compartments. The best way to determine the method of choice for any particular problem is to compare both methods with several values of 1t to see which allows the largest 1t consistent with the desired accuracy. In performing such trials, one must remember that the stability properties of a simulation depend on the entire system that is being modeled. Because of interactions between the “biological” components and any “nonbiological” elements, such as stimulators or voltage clamps, the time constants of the entire system may be different from those of the biological components alone. A current source (perfect current clamp) does not affect stability because it does not change the time constants. Any other signal source imposes a load on the compartment to which it is attached, changing the time constants and potentially requiring use of a smaller time step to avoid numerical oscillations in the C-N method. The more closely a signal source approximates a voltage source (perfect voltage clamp), the greater this effect will be.
3.3.1 Efficiency. Nonlinear equations generally need to be solved iteratively to maintain second-order correctness. However, voltage-dependent membrane properties, which are typically formulated in analogy to Hodgkin-Huxley (HH) type channels, allow the cable equation to be cast in a linear form, still second order correct, that can be solved without iterations. A direct solution of the voltage equations at each time step t → t + 1t using the linearized membrane current I(V, t) = G · (V − E) is sufficient as long as the slope conductance G and the effective reversal potential E are known to second order at time t + 0.51t. HH type channels are easy to solve at t + 0.51t since the conductance is a function of state variables, which can be computed using a separate time step that is offset by 0.51t with respect to the voltage equation time step. That is, to integrate a state from t − 0.51t to t + 0.51t, we require only a second-order correct value for the voltage-dependent rates at the midpoint time t. Figure 5 contrasts this approach with the common technique of replacing nonlinear coefficients by their values at the beginning of a time step. For HH equations in a single compartment, the staggered time grid approach converts four simultaneous nonlinear equations at each time step to four independent linear equations that have the same order of accuracy at each time step. Since the voltage-dependent rates use the voltage at the midpoint of the integration step, integration of channel states can be done analytically in just a single addition and multiplication operation and two table-lookup operations. Although this efficient scheme achieves second-order accuracy,
1192
M. L. Hines and N. T. Carnevale
Figure 5: The equations shown at the top of the figure are computed using the Crank-Nicholson method. (Top) x(t+1t) and y(t+1t) are determined using their values at time t. (Bottom) Staggered time steps yield decoupled linear equations. y(t + 1t/2) is determined using x(t), after which x(t + 1t) is determined using y(t + 1t/2).
the trade-off is that the tables depend on the value of the time step and must be recomputed whenever the time step changes. Neuronal architecture can also be exploited to increase computational efficiency. Since neurons generally have a branched tree structure with no loops, the number of arithmetic operations required to solve the cable equation by gaussian elimination is exactly the same as for an unbranched cable with the same number of compartments. That is, we need only O(N) arith-
The NEURON Simulation Environment
1193
metic operations for the equations that describe N compartments connected in the form of a tree, even though standard gaussian elimination generally takes O(N3 ) operations to solve N equations in N unknowns. The tremendous efficiency increase results from the fact that in a tree, one can always find a leaf compartment i that is connected to only one other compartment j, so that the equation for compartment i (see equation 3.3a) involves only the voltages in compartments i and j, and the voltage in leaf compartment i is involved only in the equations for compartments i and j (see equations 3.3a and 3.3b): aii Vi + aij Vj = bi
(3.3a)
aji Vi + ajj Vj + [terms from other compartments] = bj .
(3.3b)
Using equation 3.3a to eliminate the Vi term from equation 3.3b, which requires O(1) (instead of N) operations, gives equation 3.4 and leaves N − 1 equations in N − 1 unknowns. ajj0 Vj + [terms from other compartments] = bj0
(3.4)
where ajj0 = ajj − (aij aji /aii ) and bj0 = bj − (bi aji /aii ). This strategy can be applied until there is only one equation in one unknown. Assume that we know the solution to these N − 1 equations, and in particular that we know Vj . Then we can find Vi from equation 3.3a with O(1) step. Therefore, the effort to solve these N equations is O(1) plus the effort needed to solve N − 1 equations. The number of operations required is independent of the branching structure, so a tree of N compartments uses exactly the same number of arithmetic operations as a one-dimensional cable of N compartments. Efficient gaussian elimination requires an ordering that can be found by a simple algorithm: choose the equation with the current minimum number of terms as the equation to use in the elimination step. This minimum degree ordering algorithm is commonly employed in standard sparse matrix solver packages. For example, NEURON’s Matrix class uses the matrix library written by Stewart and Leyk (1994). This and many other sparse matrix packages are freely available on the Internet at http://www.netlib.org.
1194
M. L. Hines and N. T. Carnevale
4 The Neuron Simulation Environment No matter how powerful and robust its computational engine may be, the real utility of any software tool depends largely on its ease of use. Therefore a great deal of effort has been invested in the design of the simulation environment provided by NEURON. In this section we first briefly consider general aspects of the high-level language used for writing NEURON programs. Then we turn to an example of a model of a nerve cell to introduce specific aspects of the user environment, after which we cover these features more thoroughly. 4.1 The hoc Interpreter. NEURON incorporates a programming language based on hoc, a floating-point calculator with C-like syntax described by Kernighan and Pike (1984). This interpreter has been extended by the addition of object-oriented syntax (not including polymorphism or inheritance) that can be used to implement abstract data types and data encapsulation. Other extensions include functions that are specific to the domain of neural simulations and functions that implement a graphical user interface (see below). With hoc one can quickly write short programs that meet most problem-specific needs. The interpreter is used to execute simulations, customize the user interface, optimize parameters, analyze experimental data, calculate new variables such as impulse propagation velocity, and so forth. NEURON simulations are not subject to the performance penalty often associated with interpreted (as opposed to compiled) languages because computationally intensive tasks are carried out by highly efficient precompiled code. Some of these tasks are related to integration of the cable equation, and others are involved in the emulation of biological mechanisms that generate and regulate chemical and electrical signals. NEURON provides a built-in implementation of the microemacs text editor. Since the choice of a programming editor is highly personal, NEURON will also accept hoc code in the form of straight ASCII files created with any other editor. 4.2 A Specific Example. In the following example we show how NEURON might be used to model the cell in the top of Figure 6. Comments in the hoc code are preceded by double slashes (//), and code blocks are enclosed in curly brackets ({}). 4.2.1 First Step: Establish Model Topology. One very important feature of NEURON is that it allows the user to think about models in terms that are familiar to the neurophysiologist, keeping numerical issues (e.g., number of spatial segments) entirely separate from the specification of morphology and biophysical properties. This separation is achieved through the use of one-dimensional cable “sections” as the basic building block from which model cells are constructed. These sections can be connected together to
The NEURON Simulation Environment
1195
Figure 6: (Top) Cartoon of a neuron with a soma, three dendrites, and an unmyelinated axon (not to scale). The diameter of the spherical soma is 50 µm. Each dendrite is 200 µm long and tapers uniformly along its length from 10 µm diameter at its site of origin on the soma, to 3 µm at its distal end. The unmyelinated cylindrical axon is 1000 µm long and has a diameter of 1 µm. An electrode (not shown) is inserted into the soma for intracellular injection of a stimulating current. (Bottom) Topology of a NEURON model that represents this cell (see text for details).
form any kind of branched cable and endowed with properties that may vary with position along their length. The idealized neuron in Figure 6 has several anatomical features whose existence and spatial relationships we want the model to include: a cell body (soma), three dendrites, and an unmyelinated axon. The following hoc code sets up the basic topology of the model: create soma, axon, dendrite[3] connect axon(0), soma(0) for i=0,2 { connect dendrite[i](0), soma(1) }
The program starts by creating named sections that correspond to the important anatomical features of the cell. These sections are attached to each other using connect statements. As noted previously, each section has a normalized position parameter x, which ranges from 0 at one end to 1 at the other. Because the axon and dendrites arise from opposite sides of the cell body, they are connected to the 0 and 1 ends of the soma section (see the
1196
M. L. Hines and N. T. Carnevale
bottom of Figure 6). A child section can be attached to any location on the parent, but attachment at locations other than 0 or 1 is generally employed only in special cases such as spines on dendrites. 4.2.2 Second Step: Assign Anatomical and Biophysical Properties. Next we set the anatomical and biophysical properties of each section. Each section has its own segmentation, length, and diameter parameters, so it is necessary to indicate which section is being referenced. There are several ways to declare which is the currently accessed section, but here the most convenient is to precede blocks of statements with the appropriate section name: // specify anatomical and biophysical properties soma { nseg = 1 // compartmentalization parameter L = 50 // [µm] length diam = 50 // [µm] diameter insert hh // standard Hodgkin-Huxley // currents gnabar hh = 0.5*0.120 // max HH sodium conductance } axon { nseg = 20 L = 1000 diam = 1 insert hh } for i=0,2 dendrite[i] { nseg = 5 L = 200 diam(0:1) = 10:3 // dendritic diameter tapers // along its length insert pas // standard passive current e pas = -65 // [mv] equilibrium potential // for passive current // [siemens/cm2 ] conductance g pas = 0.001 // for passive current }
The fineness of the spatial grid is determined by the compartmentalization parameter nseg. Here the soma is lumped into a single compartment (nseg = 1), while the axon and each of the dendrites are broken into several subcompartments (nseg = 20 and 5, respectively). In this example, we specify the geometry of each section by assigning values directly to section length and diameter. This creates a stylized model. Alternatively, one can use the 3D method, in which NEURON computes
The NEURON Simulation Environment
1197
section length and diameter from a list of (x, y, z, diam) measurements (see section 4.5). Since the axon is a cylinder, the corresponding section has a fixed diameter along its entire length. The spherical soma is represented by a cylinder with the same surface area as the sphere. The dimensions and electrical properties of the soma are such that its membrane will be nearly isopotential, so the cylinder approximation is not a significant source of error. If chemical signals such as intracellular ion concentrations were important in this model, it would be necessary to approximate not only the surface area but also the volume of the soma. Unlike the axon, the dendrites become progressively narrower with distance from the soma. Furthermore, unlike the soma, they are too long to be lumped into a single compartment with constant diameter. The taper of the dendrites is accommodated by assigning a sequence of decreasing diameters to their segments. This is done through the use of range variables, which are discussed in section 4.4. In this model the soma and axon contain HH sodium, potassium, and leak channels (Hodgkin & Huxley, 1952), while the dendrites have constant, linear (“passive”) ionic conductances. The insert statement assigns the biophysical mechanisms that govern electrical signals in each section. Particular values are set for the density of sodium channels on the soma (gnabar hh) and for the ionic conductance and equilibrium potential of the passive current in the dendrites (g pas and e pas). More information about membrane mechanisms is presented in section 4.6. 4.2.3 Third Step: Attach Stimulating Electrodes. This code emulates the use of an electrode to inject a stimulating current into the soma by placing a current pulse stimulus in the middle of the soma section. The stimulus starts at t = 1 ms, lasts for 0.1 ms, and has an amplitude of 60 nA. objref stim soma stim = new IClamp(0.5) stim.del = 1 stim.dur = 0.1 stim.amp = 60
// // // //
put it in middle of soma [ms] delay [ms] duration [nA] amplitude
The stimulating electrode is an example of a point process. Point processes are discussed in more detail in section 4.6. 4.2.4 Fourth Step: Control Simulation Time Course. At this point all model parameters have been specified. All that remains is to define the simulation parameters, which govern the time course of the simulation, and write some code that executes the simulation. This is generally done in two procedures. The first procedure initializes the membrane potential and the states of the inserted mechanisms (channel states, ionic concentrations, extracellular potential next to the membrane).
1198
M. L. Hines and N. T. Carnevale
The second procedure repeatedly calls the built-in single-step integration function fadvance() and saves, plots, or computes functions of the desired output variables at each step. In this procedure it is possible to change the values of model parameters during a run. dt = 0.05 tstop = 5 finitialize(-65)
// // // //
[ms] integration time step [ms] initialize membrane potential, state variables, and time
proc integrate() { print t, soma.v(0.5)
// show starting time // and initial somatic // membrane potential
while (t < tstop) { fadvance() // advance solution by dt // function calls to save or plot results // would go here print t, soma.v(0.5) // show present time // and somatic // membrane potential // statements that change model parameters // would go here } }
The built-in function finitialize() initializes time t to 0, membrane potential v to −65 mv throughout the model, and the HH state variables m, n, and h to their steady-state values at v = −65 mv. Initialization can also be performed with a user-written routine if there are special requirements that finitialize() cannot accommodate, such as nonuniform membrane potential. Both the integration time step dt and the solution time t are global variables. For this example dt = 50 µs. The while(){} statement repeatedly calls fadvance(), which integrates the model equations over the interval dt and increments t by dt on each call. For this example, the time and somatic membrane potential are displayed at each step. This loop exits when t ≥ tstop. When this program is first processed by the NEURON interpreter, the model is set up and initiated, but the integrate() procedure is not executed. When the user enters an integrate() statement in the NEURON interpreter window, the simulation advances for 5 ms using 50 µs time steps. 4.3 Section Variables. Three parameters apply to the section as a whole: cytoplasmic resistivity Ra (Ä cm), the section length L, and the compartmentalization parameter nseg. The first two are ordinary in the sense that they
The NEURON Simulation Environment
1199
do not affect the structure of the equations that describe the model. Note that the hoc code specifies values for L but not for Ra. This is because each section in a model is likely to have a different length, whereas the cytoplasm (and therefore Ra) is usually assumed to be uniform throughout the cell. The default value of Ra is 35.4 Ä cm, which is appropriate for invertebrate neurons. Like L it can be assigned a new value in any or all sections (e.g., ∼ 200 Ä cm for mammalian neurons). The user can change the compartmentalization parameter nseg without having to modify any of the statements that set anatomical or biophysical properties. However, if parameters vary with position in a section, care must be taken to ensure that the model incorporates the spatial detail inherent in the parameter description. 4.4 Range Variables. Like dendritic diameter in our example, most cellular properties are functions of the position parameter x. NEURON has special provisions for dealing with these properties, which are called range variables. Other examples of range variables are the membrane potential v and ionic conductance parameters such as the maximum HH sodium conductance gnabar hh (siemens/cm2 ). Range variables enable the user to separate property specification from segment number. A range variable is assigned a value in one of two ways. The simpler and more common is as a constant. For example, the statement axon.diam = 10 asserts that the diameter of the axon is uniform over its entire length. The syntax for a property that changes along a length of a section is rangevar(xmin : xmax) = e1 : e2. The four italicized symbols are expressions, with e1 and e2 being the values of the property at xmin and xmax, respectively. The position expressions must meet the constraint 0 ≤ xmin ≤ xmax ≤ 1. Linear interpolation is used to assign the values of the property at the segment centers that lie in the position range [xmin, xmax]. In this manner a continuously varying property can be approximated by a piecewise linear function. If the range variable is diameter, neither e1 nor e2 should be 0, or corresponding axial resistance will be infinite. In our model neuron, the simple dendritic taper is specified by diam(0:1) = 10:3 and nseg = 5. This results in five segments that have centers at x = 0.1, 0.3, 0.5, 0.7, and 0.9 and diameters of 9.3, 7.9, 6.5, 5.1, and 3.7, respectively. The value of a range variable at the center of a segment can appear in any expression using the syntax rangevar(x) in which 0 ≤ x ≤ 1. The value returned is the value at the center of the segment containing x, not the linear interpolation of the values stored at the centers of adjacent segments. If the parentheses are omitted, the position defaults to a value of 0.5 (middle of the section). A special form of the for statement is available: for (var) stmt. For each value of the normalized position parameter x that defines the center
1200
M. L. Hines and N. T. Carnevale
of each segment in the selected section (along with positions 0 and 1), this statement assigns var that value and executes the stmt. This hoc code would print the membrane potential as a function of physical position (in µm) along the axon: axon for (x) print x*L, v(x)
4.5 Specifying Geometry: Stylized versus 3D. There are two ways to specify section geometry. Our example uses the stylized method, which simply assigns values to section length and diameter. This is most appropriate when cable length and diameter are authoritative and 3D shape is irrelevant. If the model is based on anatomical reconstruction data (quantitative morphometry) or if 3D visualization is paramount, it is better to use the 3D method. This approach keeps the anatomical data in a list of (x, y, z, diam) “points.” The first point is associated with the end of the section that is connected to the parent (this is not necessarily the zero end), and the last point is associated with the opposite end. There must be at least two points per section, and they should be ordered in terms of monotonically increasing arc length. This pt3d list, which is the authoritative definition of the shape of the section, automatically determines the length and diameter of the section. When the pt3d list is nonempty, the shape model used for a section is a sequence of frusta. The pt3d points define the locations and diameters of the ends of these frusta. The effective area, diameter, and resistance of each segment are computed from this sequence of points by trapezoidal integration along the segment length. This takes into account the extra area introduced by diameter changes; even degenerate cones of 0 length can be specified (i.e., two points with same coordinates but different diameters), which add area but not length to the section. No attempt is made to deal with the effects of centroid curvature on surface area. The number of 3D points used to describe a shape has nothing to do with nseg and does not affect simulation speed. 4.6 Density Mechanisms and Point Processes. The insert statement assigns biophysical mechanisms, which govern electrical and (if present) chemical signals, to a section. Many sources of electrical and chemical signals are distributed over the membrane of the cell. These density mechanisms are described in terms of current per unit area and conductance per unit area; examples include voltage-gated ion channels such as the HH currents. However, density mechanisms are not the most appropriate representation of all signal sources. Synapses and electrodes are best described in terms of localized absolute current in nanoamperes and conductance in microsiemens. These are called point processes.
The NEURON Simulation Environment
1201
An object syntax is used to manage the creation, insertion, attributes, and destruction of point processes. For example, a current clamp (electrode for injecting a current) is created by declaring an object variable and assigning it a new instance of the IClamp object class. When a point process is no longer referenced by any object variable, the point process is removed from the section and destroyed. In our example, redeclaring stim with the statement objref stim would destroy the pulse stimulus, since no other object variable is referencing it. The location of a point process can be changed with no effect on its other attributes. In our example, the statement dendrite[2] stim.loc(1) would move the current stimulus to the distal end of the third dendrite. Many user-defined density mechanisms and point processes can be simultaneously present in each compartment of a neuron. One important difference between density mechanisms and point processes is that any number of the same kind of point process can exist at the same location. User-defined density mechanisms and point processes can be linked into NEURON using the model description language NMODL. This lets the user focus on specifying the equations for a channel or ionic process without regard to its interactions with other mechanisms. The NMODL translator then constructs the appropriate C program, which is compiled and becomes available for use in NEURON. This program properly and efficiently computes the total current of each ionic species used, as well as the effect of that current on ionic concentration, reversal potential, and membrane potential. An extensive discussion of NMODL is beyond the scope of this article, but its major advantages can be listed succinctly. 1. Interface details to NEURON are handled automatically, and there are a great many such details. NEURON needs to know that model states are range variables and which model parameters can be assigned values and evaluated from the interpreter. Point processes need to be accessible via the interpreter object syntax, and density mechanisms need to be added to a section when the “insert” statement is executed. If two or more channels use the same ion at the same place, the individual current contributions need to be added together to calculate a total ionic current. 2. Consistency of units is ensured. 3. Mechanisms described by kinetic schemes are written with a syntax in which the reactions are clearly apparent. The translator provides tremendous leverage by generating a large block of C code that calculates the analytic Jacobian and the state fluxes. 4. There is often a great increase in clarity since statements are at the model level instead of the C programming level and are independent of the numerical method. For example, sets of differential and non-
1202
M. L. Hines and N. T. Carnevale
linear simultaneous equations are written using an expression syntax such as x’ = f(x, y, t) ~ g(x, y) = h(x, y)
where the prime refers to the derivative with respect to time (multiple primes such as x’’ refer to higher derivatives), and the tilde introduces an algebraic equation. The algebraic portion of such systems of equations is solved by Newton’s method, and a variety of methods are available for solving the differential equations, such as Runge-Kutta or backward Euler. 5. Function tables can be generated automatically for efficient computation of complicated expressions. 6. Default initialization behavior of a channel can be specified. 4.7 Graphical Interface. The user is not limited to operating within the traditional code-based command-mode environment. Among its many extensions to hoc, NEURON includes functions for implementing a fully graphical, windowed interface. Through this interface, and without having to write any code at all, the user can effortlessly create and arrange displays of menus, parameter value editors, graphs of parameters and state variables, and views of the model neuron. Anatomical views, called space plots, can be explored, revealing what mechanisms and point processes are present and where they are located. The purpose of NEURON’s graphical interface is to promote a match between what the user thinks is inside the computer and what is actually there. These visualization enhancements are a major aid to maintaining conceptual control over the simulation because they provide immediate answers to questions about what is being represented in the computer. The interface has no provision for constructing neuronal topology, a conscious design choice based on the strong likelihood that a graphical toolbox for building neuronal topologies would find little use. Small models with simple topology are so easily created in hoc that a graphical topology editor is unnecessary. More complex models are too cumbersome to deal with using a graphical editor. It is best to express the topological specifications of complex stereotyped models through algorithms, written in hoc, that generate the topology automatically. Biologically realistic models often involve hundreds or thousands of sections, whose dimensions and interconnections are contained in large data tables generated by hours of painstaking quantitative morphometry. These tables are commonly read by hoc procedures that in turn create and connect the required sections without operator intervention. The basic features of the graphical interface and how to use it to monitor and control simulations are discussed elsewhere (Moore & Hines, 1996).
The NEURON Simulation Environment
1203
However, several sophisticated analysis and simulation tools that have special utility for nerve simulation are worthy of mention: • The Function Fitter optimizes a parameterized mathematical expression to minimize the least squared difference between the expression and data. • The Run Fitter allows one to optimize several parameters of a complete neuron model to experimental data. This is most useful in the context of voltage clamp data that are contaminated by incomplete space clamp or models that cannot be expressed in closed form, such as kinetic schemes for channel conductance. • The Electrotonic Workbench plots small signal input and transfer impedance and voltage attenuation as functions of space and frequency (Carnevale, Tsai, & Hines, 1996). These plots include the neuromorphic (Carnevale, Tsai, Claiborne, & Brown, 1995) and L versus x (O’Boyle, Barrevale, Carnevale, Claiborne, & Brown, 1996) renderings of the electrotonic transformation (Brown et al., 1992; Tsai, Carnevale, Claiborne, & Brown, 1994; Zador, Agmon-Snir, & Segev, 1995). By revealing the effectiveness of signal transfer, the Workbench quickly provides insight into the “functional shape” of a neuron. All interactions with these and other tools takes place in the graphical interface, and no interpreter programming is needed to use them. However, they are constructed entirely within the interpreter and can be modified when special needs require. 4.8 Object-Oriented Syntax. 4.8.1 Neurons. It is often convenient to deal with groups of sections that are related. Therefore NEURON provides a data class called a SectionList that can be used to identify subsets of sections. Section lists fit nicely with the “regular expression” method of selecting sections, used in earlier implementations of NEURON, in that the section list is easily constructed by using regular expressions to add and delete sections. After the list is constructed it is available for reuse, and it is much more efficient to loop over the sections in a section list than to pick out the sections accepted by a combination of regular expressions. This code objref alldend alldend = new SectionList() forsec "dend" alldend.append() forsec alldend print secname()
forms a list of all the sections whose names contain the string “dend” and then iterates over the list, printing the name of each section in it. For the example program presented here, this would generate the following output
1204
M. L. Hines and N. T. Carnevale
in the NEURON interpreter window, dendrite[0] dendrite[1] dendrite[2]
although in this very simple example it would clearly have been easy enough to loop over the array of dendrites directly, for example: for i = 0,2 { dendrite[i] print secname() }
4.8.2 Networks. To help the user manage very large simulations, the interpreter syntax has been extended to facilitate the construction of hierarchical objects. This is illustrated by the following code fragment, which specifies a pattern for a simple stylized neuron consisting of three dendrites connected to one end of a soma and an axon connected to the other end: begintemplate Cell1 public soma, dendrite, axon create soma, dendrite[3], axon proc init() { for i=0,2 connect dendrite[i](0), soma(0) connect axon(0), soma(1) axon insert hh } endtemplate Cell1
Whenever a new instance of this pattern is created, the init() procedure automatically connects the soma, dendrite, and axon sections together. A complete pattern would also specify default membrane properties, as well as the number of segments for each section. Names that can be referenced outside the pattern are listed in the public statement. In this case, since init is not in the list, the user could not reinitialize by calling the init() procedure. Public names are referenced through a dot notation. The particular benefit of using templates (“classes” in standard objectoriented terminology) is the fact that they can be employed to create any number of instances of a pattern. For example, objref cell[10][10] for i=0,9 for j=0,9 cell[i][j]=new Cell1()
creates an array of one hundred objects of type Cell1 that can be referenced individually by the object variable cell. In this example, cell[4][5].axon. gnabar hh(0.5) is the value of the maximum HH sodium conductance in the middle of the axon of cell[4][5].
The NEURON Simulation Environment
1205
As this example implies, templates offer a natural syntax for the creation of networks. However, it is entirely up to the user to organize the templates logically so that they appropriately reflect the structure of the problem. Generally any given structural organization can be viewed as a hierarchy of container classes, such as cells, microcircuits, layers, or networks. The important issue is how much effort is required for the concrete network representation to support a range of logical views of the same abstract network. A logical view that organizes the cells differently may not be easy to compute if the network is built as an elaborate hierarchy. This kind of pressure tends to encourage relatively flat organizations that make it easier to implement functions that search for specific information. The bottom line is that network simulation design remains an ad hoc process that requires careful programming judgment. One very important class of logical views that are not generally organizable as a hierarchy are those of synaptic organization. In connecting cells with synapses, one is often driven to deal with general graphs, which is to say no structure at all. In addition to the notions of classes and objects (a synapse is an object with a pre- and a postsynaptic logical connection) the interpreter offers one other fundamental language feature that can be useful in dealing with objects that are collections of other objects: the notion of iterators, taken from the Sather programming language (Murer, Omohundro, Stoutamire, & Szyerski, 1996). This is a separation of the process of iteration from that of “what is to be done for each item.” If a programmer implements one or more iterators in a collection class, the user of the class does not need to know how the class indexes its items. Instead the class will return each item in turn for execution in the context of the loop body. This allows the user to write for layer1.synapses(syn, type) { // statements that manipulate the object // reference named "syn" (The iterator // causes "syn" to refer, in turn, to // each synapse of a certain type in // the layer1 object) }
without being aware of the possibly complicated process of picking out these synapses from the layer (that is the responsibility of the author of the class of which layer1 is an instance). It is to be sadly emphasized that these kinds of language features, though very useful, do not impose any policy with regard to the design decisions that users must make in building their networks. Different programmers express very different designs on the same language base, with the consequence that it is more often than not infeasible to reconcile slightly different representations of even very similar concepts.
1206
M. L. Hines and N. T. Carnevale
An example of a useful way to deal uniformly with the issue of synaptic connectivity is the policy implemented in NEURON by Lytton (1996). This implementation uses the normal NMODL methodology to define a synaptic conductance model and enclose it within a framework that manages network connectivity. 5 Summary The recent striking expansion in the use of simulation tools in the field of neuroscience has been encouraged by the rapid growth of quantitative observations that both stimulate and constrain the formulation of new hypotheses of neuronal function, enabled by the availability of ever-increasing computational power at low cost. These factors have motivated the design and implementation of NEURON, the goal of which is to provide a powerful and flexible environment for simulations of individual neurons and networks of neurons. NEURON has special features that accommodate the complex geometry and nonlinearities of biologically realistic models, without interfering with its ability to handle more speculative models that involve a high degree of abstraction. As we note in this article, one particularly advantageous feature is that the user can specify the physical properties of a cell without regard for the strictly computational concern of how many compartments are employed to represent each of the cable sections. In a future publication, we will examine how the NMODL translator is used to define new membrane channels and calculate ionic concentration changes. Another will describe the Vector class. In addition to providing very efficient implementations of frequently needed operations on lists of numbers, the vector class offers a great deal of programming leverage, especially in the management of network models. NEURON source code, executables, and documents are available at http://neuron.duke.edu and http://www.neuron.yale.edu, and by ftp from ftp.neuron.yale.edu. Acknowledgments We thank John Moore, Zach Mainen, Bill Lytton, David Jaffe, and the many other users of NEURON for their encouragement, helpful suggestions, and other contributions. This work was supported by NIH grant NS 11613 (Computer Methods for Physiological Problems) to M.L.H. and by the Yale Neuroengineering and Neuroscience Center. References Bernander, O., Douglas, R. J., Martin, K. A. C., & Koch, C. (1991). Synaptic background activity influences spatiotemporal integration in single pyramidal cells. Proc. Nat. Acad. Sci., 88, 11569–11573.
The NEURON Simulation Environment
1207
Brown, T. H., Zador, A. M., Mainen, Z. F., & Claiborne, B. J. (1992). Hebbian computations in hippocampal dendrites and spines. In T. McKenna, J. Davis, & S. F. Zornetzer (Eds.), Single Neuron Computation (pp. 81–116). San Diego: Academic Press. Carnevale, N. T., & Rosenthal, S. (1992). Kinetics of diffusion in a spherical cell: I. No solute buffering. J. Neurosci. Meth., 41, 205–216. Carnevale, N. T., Tsai, K. Y., Claiborne, B. J., & Brown, T. H. (1995). The electrotonic transformation: A tool for relating neuronal form to function. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in Neural Information Processing Systems (vol. 7, pp. 69–76). Cambridge, MA: MIT Press. Carnevale, N. T., Tsai, K. Y., & Hines, M. L. (1996). The Electrotonic Workbench. Society for Neuroscience Abstracts, 22, 1741. Cauller, L. J., & Connors, B. W. (1992). Functions of very distal dendrites: Experimental and computational studies of layer I synapses on neocortical pyramidal cells. In T. McKenna, J. Davis, & S. F. Zornetzer (Eds.), Single Neuron Computation (pp. 199–229). San Diego: Academic Press. Destexhe, A., Babloyantz, A., & Sejnowski, T. J. (1993). Ionic mechanisms for intrinsic slow oscillations in thalamic relay neurons. Biophys. J., 65, 1538– 1552. Destexhe, A., Contreras, D., Sejnowski, T. J., & Steriade, M. (1994). A model of spindle rhythmicity in the isolated thalamic reticular nucleus. J. Neurophysiol., 72, 803–818. Destexhe, A., Contreras, D., Steriade, M., Sejnowski, T. J., & Huguenard, J. R. (1996). In vivo, in vitro and computational analysis of dendritic calcium currents in thalamic reticular neurons. J. Neurosci., 16, 169–185. Destexhe, A., McCormick, D. A., & Sejnowski, T. J. (1993). A model for 8–10 Hz spindling in interconnected thalamic relay and reticularis neurons. Biophys. J., 65, 2474–2478. Destexhe, A., & Sejnowski, T. J. (1995). G-protein activation kinetics and spillover of GABA may account for differences between inhibitory responses in the hippocampus and thalamus. Proc. Nat. Acad. Sci., 92, 9515–9519. H¨ausser, M., Stuart, G., Racca, C., & Sakmann, B. (1995). Axonal initiation and active dendritic propagation of action potentials in substantia nigra neurons. Neuron, 15, 637–647. Hines, M. (1984). Efficient computation of branched nerve equations. Int. J. BioMed. Comput., 15, 69–76. Hines, M. (1989). A program for simulation of nerve equations with branching geometries. Int. J. Bio-Med. Comput., 24, 55–68. Hines, M. (1993). NEURON—a program for simulation of nerve equations. In F. Eeckman (Ed.), Neural Systems: Analysis and Modeling (pp. 127–136). Norwell, MA: Kluwer. Hines, M. (1994). The NEURON simulation program. In J. Skrzypek (Ed.), Neural Network Simulation Environments (pp. 147–163). Norwell, MA: Kluwer. Hines, M., & Carnevale, N. T. (1995). Computer modeling methods for neurons. In M. A. Arbib (Ed.), The Handbook of Brain Theory and Neural Networks (pp. 226–230). Cambridge, MA: MIT Press.
1208
M. L. Hines and N. T. Carnevale
Hines, M., & Shrager, P. (1991). A computational test of the requirements for conduction in demyelinated axons. J. Restor. Neurol. Neurosci., 3, 81–93. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117, 500–544. Hsu, H., Huang, E., Yang, X.-C., Karschin, A., Labarca, C., Figl, A., Ho, B., Davidson, N., & Lester, H. A. (1993). Slow and incomplete inactivations of voltage-gated channels dominate encoding in synthetic neurons. Biophys. J., 65, 1196–1206. Jaffe, D. B., Ross, W. N., Lisman, J. E., Miyakawa, H., Lasser-Ross, N., & Johnston, D. (1994). A model of dendritic Ca2+ accumulation in hippocampal pyramidal neurons based on fluorescence imaging experiments. J. Neurophysiol., 71, 1065–1077. Kernighan, B. W., & Pike, R. (1984). Appendix 2: Hoc manual. In B. W. Kernighen & R. Pike (Eds.), The UNIX Programming Environment (pp. 329–333). Englewood Cliffs, NJ: Prentice Hall. Lindgren, C. A., & Moore, J. W. (1989). Identification of ionic currents at presynaptic nerve endings of the lizard. J. Physiol., 414, 210–222. Lytton, W. W. (1996). Optimizing synaptic conductance calculation for network simulations. Neural Computation, 8, 501–509. Lytton, W. W., Destexhe, A., & Sejnowski, T. J. (1996). Control of slow oscillations in the thalamocortical neuron: A computer model. Neurosci., 70, 673–684. Lytton, W. W., & Sejnowski, T. J. (1992). Computer model of ethosuximide’s effect on a thalamic neuron. Ann. Neurol., 32, 131–139. Mainen, Z. F., Joerges, J., Huguenard, J., & Sejnowski, T. J. (1995). A model of spike initiation in neocortical pyramidal neurons. Neuron, 15, 1427–1439. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Mascagni, M. V. (1989). Numerical methods for neuronal modeling. In C. Koch & I. Segev (Eds.), Methods in Neuronal Modeling (pp. 439–484). Cambridge, MA: MIT Press. Moore, J. W., & Hines, M. (1996). Simulations with NEURON 3.1, on-line documentation in HTML format, available at http://neuron.duke.edu. Murer, S., Omohundro, S. M., Stoutamire, D., & Szyerski, C. (1996). Iteration abstraction in Sather. ACM Transactions on Programming Languages and Systems, 18, 1–15. O’Boyle, M. P., Carnevale, N. T., Claiborne, B. J., & Brown, T. H. (1996). A new graphical approach for visualizing the relationship between anatomical and electrotonic structure. In J. M. Bower (Ed.), Computational Neuroscience: Trends in Research 1995. San Diego: Academic Press. Rall, W. (1964). Theoretical significance of dendritic tree for input-output relation. In R. F. Reiss (Ed.), Neural Theory and Modeling (pp. 73–97). Stanford: Stanford University Press. Rall, W. (1989). Cable theory for dendritic neurons. In C. Koch & I. Segev (Ed.), Methods in Neuronal Modeling (pp. 8–62). Cambridge, MA: MIT Press. Softky, W. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neurosci., 58, 13–41.
The NEURON Simulation Environment
1209
Stewart, D., & Leyk, Z. (1994). Meschach: Matrix computations in C. In Proceedings of the Centre for Mathematics and Its Applications (Vol. 32). Canberra, Australia: School of Mathematical Sciences, Australian National University. Traynelis, S. F., Silver, R. A., & Cull-Candy, S. G. (1993). Estimated conductance of glutamate receptor channels activated during epscs at the cerebellar mossy fiber-granule cell synapse. Neuron, 11, 279–289. Tsai, K. Y., Carnevale, N. T., & Brown, T. H. (1994). Hebbian learning is jointly controlled by electrotonic and input structure. Network, 5, 1–19. Tsai, K. Y., Carnevale, N. T., Claiborne, B. J., & Brown, T. H. (1994). Efficient mapping from neuroanatomical to electrotonic space. Network, 5, 21–46. Zador, A. M., Agmon-Snir, H., & Segev, I. (1995). The morphoelectrotonic transform: A graphical approach to dendritic function. J. Neurosci., 15, 1669–1682.
Received September 26, 1996; accepted January 10, 1997.
ARTICLE
Communicated by Jurgen ¨ Schmidhuber
On Bias Plus Variance David H. Wolpert IBM Almaden Research Center, San Jose, CA 95120, U.S.A.
This article presents several additive corrections to the conventional quadratic loss bias-plus-variance formula. One of these corrections is appropriate when both the target is not fixed (as in Bayesian analysis) and training sets are averaged over (as in the conventional bias plus variance formula). Another additive correction casts conventional fixed-trainingset Bayesian analysis directly in terms of bias plus variance. Another correction is appropriate for measuring full generalization error over a test set rather than (as with conventional bias plus variance) error at a single point. Yet another correction can help explain the recent counterintuitive bias-variance decomposition of Friedman for zero-one loss. After presenting these corrections, this article discusses some other lossfunction-specific aspects of supervised learning. In particular, there is a discussion of the fact that if the loss function is a metric (e.g., zero-one loss), then there is bound on the change in generalization error accompanying changing the algorithm’s guess from h1 to h2 , a bound that depends only on h1 and h2 and not on the target. This article ends by presenting versions of the bias-plus-variance formula appropriate for logarithmic and quadratic scoring, and then all the additive corrections appropriate to those formulas. All the correction terms presented are a covariance, between the learning algorithm and the posterior distribution over targets. Accordingly, in the (very common) contexts in which those terms apply, there is not a “bias-variance trade-off” or a “bias-variance dilemma,” as one often hears. Rather there is a bias-variance-covariance trade-off.
1 Introduction The bias-plus-variance formula (Geman et al., 1992) is an extremely powerful tool for analyzing supervised learning scenarios that have quadratic loss functions, fixed targets (i.e., “truths”), and averages over the training sets. Indeed, there is little doubt that it is the most frequently cited formula in the statistical literature for analyzing such scenarios. Despite this breadth of utility in such scenarios, the bias-plus-variance formula has never been extended to other learning scenarios. In this article an additive correction to the formula is presented, appropriate for learning scenarios where the target is not fixed. The associated Neural Computation 9, 1211–1243 (1997)
c 1997 Massachusetts Institute of Technology °
1212
David H. Wolpert
formula for expected loss constitutes a midway point between Bayesian analysis and conventional bias-plus-variance analysis, in that both targets and training sets are averaged over. After presenting this correction, other correction terms are presented, appropriate for when other sets of random variables are averaged over. In particular, the correction term to the bias-plus-variance formula for when the test set point is not fixed—as it is not in almost all of computational learning theory as well as most other investigations of “generalization error”—is presented. (The conventional bias plus variance formula has the test point fixed.) In addition, it is shown how to cast expected loss in conventional Bayesian analysis (where the training set is fixed but targets are averaged over) directly in terms of bias plus variance. All of this emphasizes that the conventional bias-plus-variance decomposition is only a very specialized case of a much more general and important phenomenon. Next is a brief discussion of some other loss-function-specific properties of supervised learning. In particular, it is shown how with quadratic loss there is a scheme that assuredly, independent of the target, improves the performance of any learning algorithm with a random component. On the other hand, using the same scheme for concave loss functions results in assured degradation of performance. It is also shown that, without any concern for the target, one can bound the change in zero-one loss generalization error associated with making some guess h1 rather than a different guess h2 . (This is not possible for quadratic loss.) All of the extensions mentioned above to the conventional version of the bias-plus-variance formula use the same quadratic loss function occurring in the conventional formula itself. That loss function is often appropriate when the output space is numeric. Kong and Dietterich (1995) recently proposed an extension of the conventional (fixed-target, training set–averaged) formula to the zero-one loss function. (See also Wolpert & Kohavi, 1997 and Kohavi & Wolpert, 1996.) That loss function is often appropriate when the output space is categorical rather than numeric. For such categorical output spaces, sometimes one’s algorithm produces a guessed probability distribution over the output space rather than a single guessed output value. In those scenarios “scoring rules” are usually a more appropriate form of measuring generalization performance than are loss functions. This article ends by presenting extensions of the fixed-target version of the bias-plus-variance formula to the logarithmic and quadratic scoring rules, and then presents the associated additive corrections to those formulas. (“Scoring rules” as explored in this article are similar to what are called “score functions” in the statistics literature; Bernardo & Smith, 1994.) All of the correction terms presented are a covariance, between the learning algorithm and the posterior distribution over targets. Accordingly, in the (very common) contexts in which they apply, there is not a “bias-variance trade-off” or a “bias-variance dilemma,” as one often hears. Rather there is a bias-variance-covariance trade-off.
On Bias Plus Variance
1213
Section 2 presents the formalism that will be used in the rest of the article. Section 3 uses this formalism to recapitulate the traditional biasplus-variance formula. Certain desiderata that the terms in the bias-plusvariance decomposition should meet are also presented there. Sections 4 and 5 present the corrections to this formula appropriate for quadratic loss. Section 6 discusses some other loss-function-specific aspects of supervised learning. Recently Friedman (1996) drew attention to an important aspect of expected error for zero-one loss. His analysis appeared to indicate that under certain circumstances, when (what he identified as) the variance increased, it could result in decreased generalization error. In section 7, it is shown how to perform a bias-variance decomposition for Friedman’s scenario where commonsense characteristics of bias and variance (like error being an increasing function of each) are preserved. This discussion serves as a useful illustration of the utility of covariance terms, since it is the presence of that term that explains Friedman’s apparently counterintuitive results. Section 8 begins by presenting the extensions of the bias-plus-variance formula for scoring-rule-based rather than loss-function-based error. That section then investigates logarithmic scoring. Section 9 investigates quadratic scoring. Finally, section 10 discusses future work. 2 Nomenclature This article uses the extended Bayesian formalism (EBF) (Wolpert, 1994, 1995, 1996a, 1996b). In the current context, the EBF is just conventional probability theory, applied to the case where one has a different random variable for the hypothesis output by the learning algorithm and for the target relationship. It is this crucial extension that separates the EBF from conventional Bayesian analysis and allows the EBF (unlike conventional Bayesian analysis) to subsume all other major mathematical treatments of supervised learning like computational learning theory and sampling theory statistics. (See Wolpert, 1994.) This section presents a synopsis of the EBF. A quick reference of this synopsis can be found in Table 1. Readers unsure of any aspects of this synopsis, and in particular unsure of any of the formal basis of the EBF or justifications for any of its assumptions, are directed to the detailed exposition of the EBF in Appendix A of Wolpert (1996a). 2.1 Overview. The input and output spaces are X and Y, respectively. For simplicity, they are taken to be finite. (This imposes no restrictions on the real-world utility of the results in this article, since in the real world data are always analyzed on a finite digital computer and are based on the output of instruments having a finite number of possible readings.) The two spaces contain n and r elements respectively. A generic element of X is indicated by x, and a generic element of Y is indicated by y. Some-
1214
David H. Wolpert
Table 1: Summary of the Terms in the EBF The sets X and Y, of sizes n and r
The input and output space, respectively
The set d, of m X-Y pairs The X-conditioned distribution over Y, f The X-conditioned distribution over Y, h The real number c
The training set The target, used to generate test sets The hypothesis, used to guess for test sets The cost
The X-value q The Y-value yF The Y-value yH
The test set point The sample of the target f at point q The sample of the hypothesis h at point q
P(h | d) P( f | d) P(d | f ) P( f )
The learning algorithm The posterior The likelihood The prior
If c = L(yF , yH ), L(·, ·) is the “loss function.” Otherwise c is given by a “scoring rule.”
times (e.g., when requiring a Bayes-optimal algorithm to guess an expected Y value) it will implicitly be assumed that Y is a large set of real numbers that are very close to one another, so that there is no significant difference between the element in Y closest to some real number ψ and ψ itself. The lowercase Greek delta (δ) indicates either the Kronecker or Dirac delta function, as appropriate. Random variables are indicated using capital letters. Associated instantiations of a random variable are indicated using lowercase letters. Note, though, that some quantities (e.g., the space X) are neither random variables nor instantiations of random variables, and therefore their written case carries no significance. Only rarely will it be necessary to refer to a random variable rather than an instantiation of it. In particular, whenever possible, the argument of a probability distribution will be taken to indicate the associated random variable (e.g., whenever possible, P(a) will be written rather than PA (a).) In accord with standard statistics notation, E(A | b) will R be used to mean the expectation value of A given B = b, that is, to mean da a P(a | b). (Sums replace integrals if appropriate.) The primary random variables are the hypothesis X-Y relationship output by the learning algorithm (indicated by H), the target (i.e., “true”) X-Y relationship (F), the training set (D), and the real-world cost (C). These variables are related to one another through other random variables representing the (test set) input space value (Q), and the associated target and hypothesis Y-values, YF and YH , respectively (with instantiations yF and yH , respectively).
On Bias Plus Variance
1215
This completes the list of random variables. Formal definitions of them appear below. As an example of the relationship between these random variables and supervised learning, f , a particular instantiation of a target, could refer to a “teacher” neural net together with superimposed noise. This noisecorrupted neural net generates the training set d. The hypothesis h, on the other hand, could be the neural net made by one’s “student” algorithm after training on d. Then q would be an input element of the test set, yF and yH associated samples of the outputs of the two neural nets for that element (the sampling of yF including the effects of the superimposed noise), and c the resultant “cost” (e.g., c could be (yF − yH )2 ). 2.2 Training Sets and Targets. m is the number of elements in the (ordered) training set d. {dX (i), dY (i)} is the set of m input and output values in d. m0 is the number of distinct values in dX . Targets f are always assumed to be of the form of X-conditioned distributions over Y, indicated by the real-valued function f (x ∈ X, y ∈ Y) (i.e., P(yF | f, q) = f (q, yF )). Equivalently, where Sr is defined as the rdimensional unit simplex, targets can be viewed as mappings f : X → Sr . Note that any such target is a finite set of real numbers indexed by an X value and a Y value. Any restrictions on f are imposed by the full joint distribution P( f, h, d, c), and in particular by its marginalization, P( f ). Note that any output noise process is automatically reflected in P(yF | f, q). Note also that the equality P(yF | f, q) = f (q, yF ) only directly refers to the generation of test set elements; in general, training set elements can be generated from targets in a different manner (for example, if training and testing have different noise levels). The “likelihood” is P(d | f ). It says how d was generated from f . As an example, the conventional independent identically distributed (I I D) likelihood is P(d | f ) = 5m i=1 π(dX (i)) f (dX (i), dY (i)) (where π(x) is the “sampling distribution”). In other words, under this likelihood, d is created by repeatedly and independently choosing an input value dX (i) by sampling π(·), and then choosing an associated output value by sampling f (dX (i), ·), the same distribution used to generate test set outputs. (None of the results in this article depend on the choice of the likelihood.) The term “posterior” usually means P( f | d), and the term “prior” usually means P( f ). 2.3 The Learning Algorithm. Hypotheses h are always assumed to be of the form of X-conditioned distributions over Y, indicated by the real-valued function h(x ∈ X, y ∈ Y) (i.e., P(yH | h, q) = h(q, yH )). Equivalently, where Sr is defined as the r-dimensional unit simplex, hypotheses can be viewed as mappings h: X → Sr . Note that any such hypothesis is a finite set of real numbers.
1216
David H. Wolpert
Any restrictions on h are imposed by P( f, h, d, c). Here and throughout, a “single-valued” distribution is one that, for a given x, is a delta function about some y. Such a distribution is a single-valued function from X to Y. As an example, if one is using a neural net as one’s regression through the training set, usually the (neural net) h is single valued. On the other hand, when one is performing probabilistic classification (as in softmax), h is not single valued. The generalization behavior of any learning algorithm (aka “generalizer”) is completely specified by P(h | d), although writing down a learning algorithm’s P(h | d) explicitly is often quite difficult. A learning algorithm is “deterministic” if the same d always gives the same h. Backpropagation with a random initial weight is not deterministic. Nearest neighbor is. The learning algorithm sees only the training set d, and in particular does not directly see the target. So P(h | f, d) = P(h | d), which means that P(h, f | d) = P(h | d) × P( f | d), and therefore P( f | h, d) = P(h, f | d)/P(h | d) = P( f | d). By definition of f , in supervised learning, YF and YH are conditionally independent given f and q: P(yF , yH | f, q) = P(yF | yH , f, q) P(yH | f, q) = P(yF | f, q)P(yH | f, q). Similarly, YF and YH are conditionally independent given d and q. R Proof. P(yF , yH | d, q) = P(yF | d, q)P(yH | d,Rq, yF ) = P(yF | d, q) dhP(yH | h, d, q, yF )P(h R | d, q, yF ) = P(yF | d, q) dhP(yH | h, q)P(h | d) = P(yF | d, q) dhP(yH | h, d, q)P(h | d, q) = P(yF | d, q)P(yH | d, q). 2.4 The Cost and “Generalization Error.” Given values of F, H, and a test set point q ∈ X, the associated cost or error is indicated by the random variable C. Often C is a loss function and can be expressed in terms of a mapping L taking Y × Y to a real number. Formally, in these cases the probability that C takes on the value c, conditioned on given values h, f , and q, is P(c | f, h, q) = 6yH ,yF P(c | yH , yF )P(yH , yF | f, h, q) = 6yH ,yF δ{c − L(yH , yF )}h(q, yH ) f (q, yF ).1 As an example, quadratic loss has L(yH , yF ) = (yH − yF )2 , so E(C | f, h, q) = 6yH ,yF f (q, yF )h(q, yH )(yH − yF )2 . Generically, when the distribution of c given f , h, and q cannot be reduced in this way to a loss function from Y×Y to R, it will be referred to as a scoring rule. Scoring rules are often appropriate when you are trying to guess a 1 Note that if L(·, ·) is a symmetric function of its arguments, this expression for P(C | f, h, q) is a non-Euclidean inner product between the (Y-indexed) vectors f (q, ·) and h(q, ·). Such inner products generically arise in response to conditional independencies among the random variables. In the case here, it arises due to the fact that P(yH | yF , h, f, q) = P(yH | h, q). Similarly, in Wolpert (1995) it is shown that because P(h | d, f ) = P(h | d), E(C | d) is a non-Euclidean inner product between the (H-indexed) vector P(h | d) and the (F-indexed) vector P( f | d); your expected cost is set by how “aligned” your learning algorithm P(h | d) is with the actual posterior, P( f | d).
On Bias Plus Variance
1217
distribution over Y, and loss functions are usually appropriate when you are trying to guess a particular value in Y. As an example, for “logarithmic scoring,” P(c | f, h, q) = δ{c − 6y f (q, y) ln[h(q, y)]}. This cost is the logarithm of the probability one would assign to an infinite data set generated according to the target f , if one had assumed (erroneously) that it was actually generated according to the hypothesis h. The “generalization error function” used in much of supervised learning is given by c0 ≡ E(C | f, h, d). It is the average over all q of the cost c, for a given target f , hypothesis h, and training set d. 2.5 Miscellaneous. Note the implicit rule of probability theory that any random variable not conditioned on is marginalized over. For example (using the conditional independencies in conventional supervised learning), expected cost given the target, training set size, and test set point, is given by Z E(C | f, m, q) = dh6d E(C | h, d, f, q)P(h | d, f, q, m)P(d | f, q, m) Z = dh6d E(C | f, h, q)P(h | d)P(d | f, q, m) Z = dhE(C | f, h, q) ×{6d P(h | d)P(dY | f, dX )P(dX | f, m, q)}. (I do not equate P(dX | f, q, m) with P(dX | m), as is conventionally (though implicitly) done in most theoretical supervised learning, because in general the test set point may be coupled to dX and even f . See Wolpert, 1995.) 3 Bias Plus Variance for Quadratic Loss 3.1 The Bias-Plus-Variance Formula. This section reviews the conventional bias-plus-variance formula for quadratic loss, with a fixed target and averages over training sets. Write E(YF | f, q) = 6y yf (q, y) E(YF2 | f, q) = 6y y2 f (q, y) Z E(YH | f, q, m) = dh6d P(d | f, q, m)P(h | d)6y yh(q, y) Z 2 | f, q, m) = dh6d P(d | f, q, m)P(h | d)6y y2 h(q, y), E(YH where for succinctness the m-conditioning in the expectation values is not indicated if the expression is independent of m. These are, in order, the
1218
David H. Wolpert
average Y and Y2 values of the target (at q) and the average of the hypotheses made in response to training sets generated from the target (again, evaluated at q). Note that these averages need not exist in Y in general. For example, this is almost always the case if Y is binary. Now write C = (YH − YF )2 . Then simple algebra (use the conditional independence of YH and YF ) verifies the following formula: 2 + (bias f,m,q )2 + variance f,m,q , E(C | f, m, q) = σ f,m,q
(3.1)
where 2 ≡ E(YF2 | f, q) − [E(YF | f, q)]2 , σ f,m,q
bias f,m,q ≡ E(YF | f, q) − E(YH | f, q, m), 2 | f, q, m) − [E(Y | f, q, m)]2 . variance f,m,q ≡ E(YH H
The subscript { f, m, q} indicates the conditioning event for the expected error and will become important below. When the conditioning event is clear, or not important, bias f,m,q may be referred to simply as the bias, and similarly for the variance and the noise terms. The bias-variance formula in Geman et al. (1992) is a special case of equation 3.1, where the learning algorithm always guesses the same h given the same training set d (something that is not the case for backpropagation with a random initial weight, for example). In addition, in Geman et al. (1992) the hypothesis h that the learning algorithm guesses is always a single-valued mapping from X to Y. Note that essentially no assumptions are made in deriving equation 3.1. Any likelihood is allowed, any learning algorithm, any relationship between q and f and/or d, and so on. This will be true for all of the analysis in this article. In addition to such generality, the utility of the bias-plus-variance formula lies in the fact that often there is a “bias-variance trade-off.” For example, it may be that a modification to a learning algorithm improves its bias for the target at hand. (This often happens when more free parameters are incorporated into the learning algorithm’s model, for example.) But this is often at the expense of increased variance. In addition, the terms in the bias-plus-variance formula all involve (functions of) expectation values of the fundamental random variables described in the previous section; no new random variables are involved. This means that the bias-plus-variance formula is particularly intuitive and easy to interpret, as the discussion in the next subsection illustrates. 3.2 Desiderata Obeyed by the Terms in the Quadratic Loss Bias-PlusVariance Formula. To facilitate the generalization of the bias-plus-variance
On Bias Plus Variance
1219
formula, define z ≡ { f, m, q}, the set of values of random variables we are conditioning on in equation 3.1. Then intuitively, in equation 3.1, for the point q: 1. σz2 measures the intrinsic error due to the target f , independent of the learning algorithm. For the current choice of z it is given by E(C | h, z)/2 for h = f , that is, it equals half the expected loss of f at “guessing itself” at the point q. 2. The bias measures the difference between the average YH and the average YF (where YF is formed by sampling f and YH is formed by sampling h’s created from d’s that are in turn created from f ). 2 plus the squared bias measures the expected loss 3. Alternatively, σ f,m,q
between YF and the average YH , E((YF − [E(YH | z)])2 | z). 4. The variance reflects the variability of the guessed yH about the average yH as one varies over training sets (generated according to the given fixed value of z). If the learning algorithm always guesses the same h for the same d, and that h is always a single-valued function from X to Y, then the variance is given directly by the variability of the learning algorithm’s guess as d is varied. 5. It is worth pointing out one special property for when z = { f, m, q}. For that case the variance does not depend on f directly, but only indirectly through the induced distribution over training sets, P(d | f, m, q). So, for example, consider having the target changed in such a way that the resultant distribution P(d | f, m, q) over training sets does not change. (For instance, this would be the case if there were negligibly small probability that the q at hand exists in the training set, perhaps because n À m.) Then the variance term does not change. This is desirable because we wish the intrinsic noise term to reflect the target alone, the bias to reflect the target’s relation with the learning algorithm, and the variance term to reflect the learning algorithm alone. In particular, for the usual intuitive characterization of variance to hold, we want it to reflect how sensitive the algorithm is to changes in the data set, regardless of the target. Note that although σz2 appears to be identical to variancez if one simply replaces YF with YH , the two quantities have different kinds of relations with the other random variables. For example, for z = { f, m, q}, variancez depends on the target as well as the learning algorithm, whereas σz2 depends only on the target. So expected quadratic loss reflects noise in the target, plus the difference between the target and the average guess, plus the variability in the guess-
1220
David H. Wolpert
ing. In particular, we have the following properties, which can be viewed as desiderata for our three terms: a. If f is a delta function in Y for the q at hand (i.e., if at the point X = q, f is a single-valued function from X to Y), the intrinsic noise term equals 0. In addition, the intrinsic noise term is independent of the learning algorithm. Finally, the intrinsic noise term is a lower bound on the error; for no learning algorithm can E(C | z) be lower than the intrinsic noise term. (In fact, for the decomposition in equation 3.1, the intrinsic noise term is the greatest lower bound on that error.) b. If the average hypothesis-determined guess equals the average targetdetermined “guess,” then biasz = 0. Biasz is large if the difference between those averages is large. c. Variance is nonnegative, equals 0 if the guessed h is always the same single-valued function (independent of d), and is large when the guessed h varies greatly in response to changes in d. d. The variance does not depend on z directly, but only indirectly through the induced distribution P(h | z). For any z, the associated variance is set by the h dependence of P(h | z). Now P(h | z) = 6d P(h | d, z)P(d | z). Moreover, it is often the case that P(h | d, z) = P(h | d). (For example, this holds for z = { f, m, q}.) In such a case, presuming one knows the learning algorithm (i.e., one knows P(h | d)), then the h dependence of P(h | z) is set by the d dependence of P(d | z). This means that changes to z that do not affect the induced distribution over d do not affect the variance. As mentioned above, this is needed if the variance term is only to reflect how sensitive the algorithm is to changes in the training set. e. All the terms in the decomposition are continuous functions of the target. This is particularly desirable when one wishes to estimate those terms from limited information concerning the target (e.g., from a finite data set). If there were discontinuities and the target were near such a discontinuity, the resultant estimates would often be poor. More generally, it would be difficult to ascribe the usual intuitive meanings to the terms in the decomposition if they were discontinuous functions of the target. Desiderata a through e are somewhat more general than conditions 1 through 4, in that (for example) they are meaningful even if Y is a nonnumeric space, so that expressions like “E(YF | z)” are not defined. Accordingly, I will rely on them more than on conditions 1 through 4 in the extensions of the bias-plus-variance formula presented below. In the final analysis though, both the conditions 1 through 4 and the desiderata a through e are not God-given principles that any bias-plusvariance decomposition must obey. Rather, they are useful aspects of the
On Bias Plus Variance
1221
decomposition that facilitate that decompostion’s “intuitive and easy” interpretation. There is nothing that precludes one’s using slight variants of these conditions, or perhaps even replacing them altogether. The next two sections show how to generalize the bias-plus-variance formula to other conditioning events besides z = { f, m, q} while still obeying (almost all of) conditions 1 through 4 and desiderata a through c. First, in section 4 the generalization to z = {m, q} is presented. Then in section 5, the generalization to arbitrary z is explored. 4 The Midway Point Between Bayesian Analysis and Bias Plus Variance 4.1 Bias Plus Variance for Averaging over Targets: The Covariance Correction. It is important to realize that illustrative as it is, the bias-plusvariance formula examines the wrong quantity. In the real world, it is almost never E(C | f, m) that is directly of interest, but rather E(C | d). (We know d and therefore can fix its value in the conditioning event. We do not know f .) Analyzing E(C | d) is the purview of Bayesian analysis (Buntine & Weigend, 1991; Bernardo & Smith, 1994). Generically, it says that for quadratic loss, one should guess the posterior average y (Wolpert, 1995). As conventionally discussed, E(C | d) does not bear any connection to the bias-plus-variance formula. However there is a midway point between Bayesian analysis and the kind of analysis that results in the bias-plusvariance formula. In this middle approach, rather than fix f as in bias plus variance, one averages over it, as in the Bayesian approach. In this way one circumvents the annoying fact that (common lore to the contrary) there need not always be a bias-variance trade-off, in that there exists an algorithm with both zero bias f,m,q and zero variance f,m,q —the algorithm that always “by luck” guesses yH = E(YF | f, q), independent of d. (Indeed, Breiman’s (1994) bagging scheme is usually justified as a way to try to estimate that “lucky” algorithm.) Since in this middle approach f is not fixed, such a lucky guess is undefined. In addition, in this middle approach, rather than fix d as in the Bayesian approach, one averages over d, as in bias plus variance. In this way one maintains the illustrative power of the bias-plus-variance formula. The result of adopting this middle approach is the following Bayesian correction to the quadratic loss bias-plus-variance formula (Wolpert, 1995): 2 + (biasm,q )2 + variancem,q − 2covm,q , E(C | m, q) = σm,q
where 2 ≡ E(Y2 | q) − [E(Y | q)]2 , σm,q F F
biasm,q ≡ E(YF | q) − E(YH | m, q),
(4.1)
1222
David H. Wolpert 2 |m, q) − [E(Y | m, q)]2 , and variancem,q ≡ E(YH H
covm,q ≡ 6yF ,yH P(yH , yF | m, q) × [yH − E(YH | m, q)] × [yF − E(YF | q)]. 2 | q, m) In this equation, the terms E(YF | q), E(YF2 | q), E(YH | q, m), and E(YH are given by the R formulas just before equation 3.1, provided one adds an outer integral df P( f ),R to average out f . To evaluate the covariance term, use P(yH , yF | m, q) = dh df 6d P(yH , yF , h, d, f, |m, q). Then use the simple identity
P(yH , yF , h, d, f, | m, q) = f (q, yF )h(q, yH )P(h | d)P(d | f, q, m)P( f ). Formally, the reason that the covariance term exists in equation 4.1 when there was none in equation 3.1 is that yH and yF are conditionally independent if one is given f and q (as in equation 3.1), but not only given q (as in equation 4.1). To illustrate the latter point, note that knowing yF , for example, tells you something about f you do not already know (assuming f is not fixed, as it is in equation 3.1). This in turn tells you something about d, and therefore something about h and yH . In this way yH and yF are statistically coupled if f is not fixed. Intuitively, the covariance term simply says that one would like the learning algorithm’s guess to track the (posterior) most likely targets, as one varies training sets. Without such tracking, simply having low biasm,q and low variancem,q does not imply good generalization. This is intuitively reasonable. Indeed, the importance of such tracking between the learning algorithm P(h | d) and the posterior P( f | d) is to be expected, given that E(C | m, q) can be written as a non-Euclidean inner product between P( f | d) and P(h | d). (This is true for any loss function; see Wolpert, 1995.) 4.2 Discussion. The terms biasm,q , variancem,q , and σm,q play the same roles as bias f,m,q , variance f,m,q , and σ f,m,q do in equation 3.1. The major difference is that here they involve averages over f according to P( f ), since the target f is not fixed. In particular, desiderata b and c are obeyed exactly by biasm,q and variancem,q . Similarly the first part of desideratum a is obeyed exactly, if the reference to “ f ” there is taken to mean all f for which P( f ) is nonzero, and if the delta functions referred to are implicitly restricted to be 2 is independent of the learning identical (at q) for all such f . In addition σm,q algorithm, in agreement with the second part of desideratum a. Now that we have the covariance term though, the third part of desideratum a is no longer obeyed. Indeed, by using P(d, yF | m, q) = P(yF | d, q)P(d | 2 as the sum of the expected cost of the best possible m, q) we can rewrite σm,q (Bayes-optimal) learning algorithm for quadratic loss (that is, the loss of the learning algorithm that obeys P(yH | d, q) = δ(yH , E(YF | d, q))), plus another term. Here and throughout, any expression of the form “δ(·, ·)” indicates the Kronecker delta function.
On Bias Plus Variance
1223
That second term is called the data worth of the problem, since it sets how much of an improvement in error can be had from paying attention to the data: 2 = 6d,yF P(d, yF | m, q)[yF − E(YF | d, q)]2 σm,q
(the Bayes-optimal algorithm’s cost)
2
+ 6d P(d | m, q)([E(YF | d, q)] − [E(YF | m, q)]2 )
(4.2)
(the data worth). One might wonder why all of this does not also apply to the conventional bias-variance decomposition for z = { f, m, q}. The reason is that for f -conditioned probabilities, the best possible algorithm does not guess E(YF | d, q) but rather E(YF | f, q) so that the equivalent of the data worth term vanishes. This is why the decomposition in equation 4.2 does not also 2 . apply to σ f,m,q Note that for the Bayes-optimal learning algorithm, covm,q exactly equals the data worth. This is to be expected, since for that learning algorithm, biasm,q equals 0 and variancem,q equals covm,q .2 To see why the data worth measures how much paying attention to the data can help you to guess f , note that the expected cost of the best possible 2 . So the data worth is the data-independent learning algorithm equals σm,q difference between the expected cost of this algorithm and that of the Bayesoptimal algorithm. Note the nice property that when the variance (as one varies training sets d) of the Bayes-optimal algorithm is large, so is data worth. So, reasonably enough, when the Bayes-optimal algorithm’s variance is large, there is a large potential gain in paying attention to the data. Conversely, if the variance for the Bayes-optimal algorithm is small, then not much can be gained by using it rather the optimal data-independent learning algorithm. As it must, E(C | m, q) reduces to the expression in equation 3.1 for E(C | f = f ∗ , m, q) for the prior P( f ) = δ( f − f ∗ ). The special case of equation 3.1 where there is no noise, and the learning algorithm always 2 The latter point follows from the following identities: For the optimal algorithm, whose guessing is governed by a delta function in Y, the variance is given by 2 6d P(d | m, q)[E(YH | d, q) − E2 (YH | q, m)]
= 6d P(d | m, q)[E2 (YF | d, q) − E2 (YF | q, m)]. In addition, the covariance is given by 6d,yF ,yH P(yH | d, q)P(yF | d, q)P(d | m, q) × [yH − E(YH | q, m)] × [yF − E(YF | q, m)] = 6d P(d | m, q) × [E(YF | d, q) − E(YF | q)] × [E(YF | d, q) − E(YF | q)].
1224
David H. Wolpert
guesses the same single-valued input-output function for the same training set, is given in Wolpert (1995). One can argue that E(C | m, q) is usually of more direct interest than E(C | f, m, q), since one can rarely specify the target in the real world but must instead be content to characterize it with a probability distribution. Insofar as this is true, by equation 4.1 there is not a bias-variance tradeoff, as is conventionally stated. Rather there is a bias-variance-covariance trade-off. More generally, one can argue that one should always analyze the distribution P(C | z) where z is chosen to reflect directly the statistical scenario with which one is confronted. (See the discussion of the honesty principle at the end of Wolpert 1995.) In the real world, this usually entails having z = {d}. In toy experiments with a fixed target, it usually means having z = { f, m}. For other kinds of experiments, it means other kinds of z’s. The only justification for not setting z this way is calculational intractability of the resultant analysis, or difficulty in determining the distributions needed to perform that resultant analysis, or both. (This latter issue is why Bayesian analysis does not automatically meet our needs in the real world. With such analysis there is often the issue of how to determine P( f ), the prior.) However at the level of abstraction of this article, neither difficulty arises. So for current purposes, the applicability of the covariance terms introduced is determined solely by the statistical scenario with which one is confronted. 5 Other Corrections to Quadratic Loss Bias-Plus-Variance 5.1 The General Quadratic Loss Bias-Plus-Variance-Plus-Covariance Decomposition. Often in supervised learning, one is interested in generalization error, the average error between f and h over all q. For fixed f and m, the expectation of this error is E(C | f, m). This quantity is ubiquitous in computational learning theory (COLT), as well as several other popular theoretical approaches to supervised learning (Wolpert, 1995). Nor is interest in it restricted to the theoretical (as opposed to applied) communities; it seems fair to say that it is far more commonly investigated than is E(C | f, m, q) in both the machine learning and neural net literatures. To address generalization error, note that for any set of random variables Z, taking (set) values z, E(C | z) = σz2 + (biasz )2 + variancez −2 covz , where σz2 ≡ E(YF2 | z) − [E(YF | z)]2 , biasz ≡ E(YF | z) − E(YH | z),
(5.1)
On Bias Plus Variance
1225
2 | z) − [E(Y | z)]2 , variancez ≡ E(YH H
and
covz ≡ 6yF ,yH P(yH , yF | z) × [yH − E(YH | z)] × [yF − E(YF | z)].
The terms in this formula can usually be interpreted just as the terms in the conventional (z = { f, m, q}) decomposition can. So, for example, variancez measures the variability of the learning algorithm’s guess as one varies over those random variables not specified in z. Equation 4.1 is a special case of this formula where z = {m, q}, and equation 3.1 is a special case where z = { f, m, q} (and consequently the covariance term vanishes). However, both of these equations have z contain q, when, by equation 5.1, we could just as easily have z not contain q. By doing that, we get a correction to the bias-plus-variance formula for generalization error. (Note that this correction is in no sense a “Bayesian” correction.) To be more precise, the following is an immediate corollary of equation 5.1: 2 + (bias f,m )2 + variance f,m −2 cov f,m E(C | f, m) = σ f,m
(5.2)
(with definitions of the terms given in equation 5.1). Note that corrections similar to that of equation 5.2 hold for E(C | f, d) and E(C | m). In all three of these cases, σz2 , biasz , and variancez play the same role 2 , bias f,m,q , and variance f,m,q in equation 3.1. The only difference as do σ f,m,q is that different quantities are averaged over. (So, for example, in E(C | f, d), variance partially reflects variability in the learning algorithm’s guess as one varies q.) In particular, most of our desiderata for these quantities are met. In addition, though, for all three of these cases, P(yH , yF | z) 6= P(yH | z)P(yF | z) in general, and therefore the covariance correction term is nonzero in general. So for these three cases, as well as for the one presented in the previous subsection, there is not a bias-variance dilemma. Rather there is a bias-variance-covariance dilemma. It is only for a rather specialized kind of analysis, where both q and f are fixed, that one can ignore the covariance term. For almost all other analyses—in particular, the popular generalization error analyses—the huge body of lore explaining various aspects of supervised learning in terms of a bias-variance dilemma is less than the whole story. 5.2 Averaging over a Random Variable versus Having It Be in z. Before leaving the topic of equations 5.1 and 5.2, it should be pointed out that one could just as easily use equation 3.1 and write E(C | f, m) = 6q P(q | 2 f, m)[σ f,m,q + bias2f,m,q + variance f,m,q ] rather than use equation 5.2. Under commonly made assumptions (see Wolpert, 1994) P(q | f, m) just equals P(q), which is often called the sampling distribution. In such cases, this q average of equation 3.1 is often straightforward to evaluate. It is instructive to compare such a q average to the expansion in equation 5.2. First, such a q average is often less informative than equation 5.2. As
1226
David H. Wolpert
an example, consider the simple case where f is single valued (i.e., a single valued function from X to Y), h is single-valued, and the same h is guessed 2 = 6q P(q | f, m) for all training sets sampled from f . Then 6q P(q | f, m)σ f,m,q variance f,m,q = 0; the q-averaged intrinsic noise and variance terms provide no useful information, and the expected error is given solely by the bias 2 tells us the amount that f (q) varies about term. On the other hand, σ f,m its average (over q) value, and variance f,m is given by a similar term. In addition bias f,m tells us the difference between those q-averages of f and h. And, finally, the covariance term tells us how much our h tracks f as one moves across X. So all the terms in equation 5.2 provide helpful information, whereas the q average of equation 2.1 reduces to the tautology “6q P(q | f, m)E(C | f, m, q) = 6q P(q | f, m)E(C | f, m, q).” In addition to this difficulty, in certain respects the individual terms in the q average of equation 3.1 do not meet all of our desiderata. In particular, desideratum b is not obeyed in general by that decomposition, if one tries to identify “bias2 ” as 6q P(q | f, m) bias2f,m,q (note that here, the z referred to in desideratum b is { f, m}). On the other hand, the terms in equation 5.2 do meet all of our desiderata, including the third part of a. (The best possible algorithm if one is given f and m is the algorithm that always guess h(q, y) = δ(y, E(YF | f )), regardless of the data.) There are also aesthetic advantages to using equation 5.1 and its corollaries (like equation 5.2) rather than averages of equation 3.1. For example, equation 5.1 does not treat z = { f, m, q} as special in any sense; all z’s are treated equally. This contrasts with formulas based on equation 3.1, in which one writes E(C | f, m) as a q average of E(C | f, m, q), E(C | m, q) as an f average of E(C | f, m, q), and so forth. Moreover, equation 5.1 holds even when z is not a subset of { f, m, q}. So, for example, E(C | h, q) can be expressed directly in terms of bias plus variance by using equation 5.2. However, there is no simple way to express the same quantity in terms of equation 3.1. Indeed, equation 5.1 even allows us to write the purely Bayesian quantity E(C | d) in bias-plus-variance terms, something we cannot do using equation 3.1: E(C | d) = σd2 + (biasd )2 + varianced − 2 covd ,
(5.3)
where σd2 ≡ E(YF2 | d) − [E(YF | d)]2 , biasd ≡ E(YF | d) − E(YH | d), 2 varianced ≡ E(YH | d) − [E(YH | d)]2 ,
and covd ≡ 6yF ,yH P(yH , yF | d) × [yH − E(YH | d)] × [yF − E(YF | d)]. By using this formula one does not even have to go to a midway point
On Bias Plus Variance
1227
between Bayesian analysis and conventional bias-plus-variance analysis to relate the two. Rather equation 5.3 directly provides a fully Bayesian bias-plus-variance decomposition. So, for example, as long as one is aware that different variables are being averaged over than is the case for the conventional bias-variance decomposition (namely, q and f rather than d), equation 5.3 allows us to say directly that for quadratic loss, in the statistical scenario of interest in Bayesian analysis the Bayes-optimal learning algorithm has zero bias. Traditionally, the very notion of “bias” has had no meaning in Bayesian analysis. None of this means that one should not use the appropriate average of equation 3.1 (in those cases where there is such an average) rather than equation 5.1 and its corollaries. There are scenarios in which that average provides a helpful perspective on the learning problem that equation 5.1 does not. For example, if Y contains very few values (e.g., two), then in many scenarios bias f,m , which equals 6q P(q | f, m)[E(YH | f, m, q) − E(YF | f, q)], is close to zero, regardless of the learning algorithm. In the same scenarios though, the associated q average of equation 3.1 term, 6q P(q | f, m) bias2f,m,q = 6q P(q | f, m)[E(YH | f, m, q) − E(YF | f, q)]2 , is often far from zero. In such cases the bias term in equation 5.2 is not very informative whereas the associated q-average term is. In the end, which kind of decomposition one uses, just like the choice of what variables z represents, depends on what one is trying to understand about one’s learning problem. 6 Other Characteristics Associated with the Loss 6.1 General Properties of Convex and/or Concave Loss Functions. There are a number of other special properties of quadratic loss besides the equations presented thus far. For example, for quadratic loss, for any f , E(C | f, m, q, algorithm A) ≤ E(C| f, m, q, algorithm B) so long as A’s guess is the average of B’s (formally, so long as we have P(yH | d, q, A) = δ(yH , E(YH | d, q, B)) = δ(yH , 6y yh(q, y)P(h | d, B)). So without any concerns for priors, one can always construct an algorithm that is assuredly superior to an algorithm with a stochastic nature: simply guess the stochastic algorithm’s average. (This is a result of Jensen’s inequality; see Wolpert, 1995, 1996a, 1996b; and Perrone, 1993.) This is true whether the stochasticity is due to nonsingle-valued h or (as with backpropagation with a random initial weight) due to the learning algorithm’s being nondeterministic. Now the EBF is symmetric under h ↔ f . Accordingly, this kind of result can immediately be “turned around.” In such a form, it says, loosely, that a prior that is the average of another prior assuredly results in lower expected cost, regardless of the learning algorithm. In this particular sense, for quadratic loss, one can place an algorithm-independent ordering over some priors. (Of course, one can also order them in an algorithm-dependent
1228
David H. Wolpert
manner if one wishes, for example, by looking at the expected generalization error of the Bayes-optimal learning algorithm for the prior in question.) The exact opposite behavior holds for loss functions that are concave rather than convex. For such functions, guessing randomly is assuredly superior to guessing the average, regardless of the target. (There is a caveat to this: one cannot have a loss function that is both concave everywhere across an infinite Y and nowhere negative, so formally, this statement holds only if we know that the yF and yH are both in a region of concave loss.) 6.2 General Properties of Metric Loss Functions. Finally, there are other special properties that some loss functions possess but that quadratic loss does not. For example, if the loss can be written as a function L(·, ·) that is a metric (e.g., absolute value loss, zero-one loss), then for any f , |E(C | f, h1 , m, q) − E(C | f, h2 , m, q)| ≤ 6y,y0 L(y, y0 )h1 (q, y)h2 (q, y0 ). (6.1) For such loss functions, you can bound how much replacing h1 by h2 can improve or hurt generalization by looking only at h1 and h2 . That bound holds without any concern for the prior over f . It is simply the expected loss between h1 and h2 . Unfortunately, quadratic loss is not a metric, and therefore one cannot employ this bound for quadratic loss. 7 An Alternative Anlysis of the Friedman Effect Kong and Dietterich (1995) have raised the issue of what the appropriate bias-plus-variance decomposition is for zero-one (misclassification) loss, L(yH , yF ) = 1 − δ(yH , yF ). They raised the issue in the context of the classical conditioning event: the target, the training set size, and the test set question. The decomposition they suggested (i.e., their suggested definitions of bias and variance for zero-one loss) has several shortcomings. Not least of these is that decomposition’s allowing negative variance. Several subsequent papers (Kohavi & Wolpert, 1996; Wolpert & Kohavi, 1997; Tibshirani, 1996; Breiman, 1996) have offered alternative decompositions, with different strengths and weaknesses, as discussed in Kohavi and Wolpert (1996) and Wolpert and Kohavi (1997). Recently Friedman (1996) contributed another zero-one loss decomposition to the discussion. Friedman’s decomposition applies only to learning algorithms that perform their classification by first predicting the probabilities ηy of all the possible output classes and then picking the class argmaxi [ηi ]. In other words, he considers cases where h is single valued, but the value h(q) ∈ Y is determined by finding the maximum over the components of a Euclidean vector random variable η, dependent on q and d, whose components always sum to 1 and are all nonnegative. Intuitively, the ηy are the probabilities of the
On Bias Plus Variance
1229
various possible YF values for q, as guessed by the learning algorithm in response to the training set. The restriction of Friedman’s analysis to such algorithms is not lacking in consequence. For example, it rules out perhaps the simplest possible learning algorithm, one that is of great interest in the computational learning community: from a set of candidate hypothesis input-output functions, pick the one that best fits the training set. This restriction makes Friedman’s analysis less general than the other zero-one loss decompositions that have been suggested. There are several other peculiar aspects to Friedman’s decomposition. Oddly, it is multiplicative rather than additive in its suggested definitions of bias and variance. More important, it assumes that as one varies training sets d while keeping f , m, and q constant, the induced distribution over η is gaussian. It then further assumes that the truncation on integrals over such a gaussian distribution imposed by the limits on the range of each of the ηy (each ηy ∈ [0, 1]) is irrelevant, so that erf functions can be replaced by integrals from −∞ to +∞. (Note that even if the gaussian in question is fairly peaked about a value well within [0, 1), that part of the dependence of the integral on certain quantities occurring in the limits on the integration may be significant, in comparison to the other ways that that integral depends on those quantities.) In addition, as Friedman defines it, bias can be negative. Furthermore, his bias does not reduce to zero for the Bayes classifier (and in fact the approximations invoked in his analysis are singular for that classifier). And perhaps most important, his definition of variance depends directly on the underlying target distribution f , rather than indirectly through the f -induced distribution over training sets P(d | f, m, q). This last means that the variance term does not simply reflect how sensitive the learning algorithm is to (target-governed) variability in the training set; the variance also changes even if one makes a change to the target that has no effect on the induced probability distribution over training sets. These oddities should not obscure the fact that Friedman’s analysis is a major contribution. Perhaps the most important aspect of Friedman’s analysis is its drawing attention to the following phenomenon. Consider the case where we are interested in expected zero-one loss conditioned on a fixed target, test set question, and training set size. Presume further that we are doing binary classification (r = 2), so the class label probabilities guessed by the learning algorithm reduce to a single real number η1 giving the guessed probability of class 1 (i.e., η = (η1 , 1 − η1 )). Examine the case where, for the test set point at hand, the average (over training sets generated from f ) value of η1 is greater than one-half. So the guess corresponding to that average η1 is class 1. However let us say that the truly optimal prediction (as determined by the target) is class 2. Now modify the scenario so that the variability (over training sets) of the guess η1 grows, while its average stays the same (i.e., the width of the distribution over η1 grows). Assume that this extra variability results in having η1 < 1/2 for more training sets.
1230
David H. Wolpert
So for more training sets, the η1 produced by the learning algorithm corresponds to the (correct) guess of class 2. Therefore increasing the variability of η1 while keeping the average the same has reduced overall generalization error. Note that this effect arises only when the average of η1 results in a non-optimal prediction, i.e., only when the average is “wrong.” (This point will be returned to shortly.) Now view this variability as, intuitively, akin to a variance. (This definition differs only a little from the formal definition of variance Friedman advocates.) Similarly, view the average of η1 as giving a bias. Then we have the peculiar result that increasing variance while keeping bias fixed can reduce overall expected generalization error. I will refer to such behavior as the Friedman effect. (See also Breiman, 1996—in particular, the discussion in the first appendix.) Variability can be identified with the width of the distribution over η1 and in that sense can indeed be taken to be a variance. The question is whether it makes sense to view it as a variance in the restricted desiderata-biased sense appropriate to bias-variance decompositions. In particular, note that whether the Friedman effect obtains—whether increasing the variability increases or decreases expected error—depends directly not only on the learning algorithm and the distribution over training sets, but also on the target (the average η1 must result in a guessed class label that differs from the optimal prediction as determined by the target for the Friedman effect to hold). This direct dependence of the Friedman effect on the target is an important clue. Recall in particular that our desiderata preclude having a variance with such a direct dependence. However, such a dependence of the “variability” term could be allowed if the variability that is being increased is identified with a variance combined with another quantity. Given the general form of the bias-variance decomposition, the obvious choice for that other quantity is a term reflecting a covariance. By having that covariance involve varying over yF values as well as training sets, we get the direct dependence on the target arising in the Friedman effect. In addition, viewing the variability as involving a covariance term along with a variance term could potentially explain away the peculiarity of the Friedman effect’s having error shrink as variability increases. This would be the case, for example, if holding the covariance part of the variability fixed while increasing the variance part always did increase expected generalization error. The idea would be that the way that variability is increased in the Friedman effect involves changes to the covariance term as well as the variance term, and it is the changes to the covariance term that are ultimately responsible for the reduction in expected generalization error. Increasing the variance term, by itself, can only increase expected error, exactly as in the conventional bias-plus-variance decomposition. As it turns out, this hypothesized behavior is exactly what lies behind the Friedman effect. This can best be seen by using a different decomposition from Friedman’s. In exploring that alternative decomposition, we will see
On Bias Plus Variance
1231
that there is nothing inherently unusual about how variance is related to generalization error for zero-one loss; in this alternative decomposition, the decrease in generalization error associated with increasing variability is due to the covariance term rather than the variance term. Common intuition is salvaged. This alternative decomposition has the additional advantage that it is valid even if the assumptions that Friedman’s analysis requires do not hold. Moreover this alternative decomposition is additive rather than multiplicative, holds for all algorithms that employ η’s, and in general avoids the other oddities of Friedman’s analysis. Unfortunately, to clarify all this, it is easiest to work with somewhat generalized notation. (The reader’s forbearance is requested.) The reason for this is that since it can take on only two values, the zero-one loss can hide a lot of phenomena via “accidental” cancellation of terms and the like. To have all such phenomena manifest, we need the generalized notation. First write the cost in terms of an abstraction of the expected zero-one loss: C = C( f, η, q) = 6y [1 − R(y, η)] f (q, y) = 1 − 6y R(y, η) f (q, y). It is required that R(y, η) be unchanged if one relabels both y and the components of η simultaneously and in the same manner. As an example of an R, for the precise case Friedman considers, there are two possible values of y, and R(y, η) = 1 for y = argmaxi (ηi ), 0 for all other y. Note that when expressed this way, for fixed q the cost is given by a dot product between two vectors indexed by Y values—namely, R and f . Moreover, for many R (e.g., the zero-one R), that dot product is between two probability distributions over y. Note also that whereas h and f are on an equal footing in the EBF (cf. section 2, especially its end), the same is not true for η and f . Indeed, whereas h and f arise in a symmetric manner in the zero-one loss cost (C( f, h, q) = 6yH ,yF [1 − δ(yH , yF )]h(q, y) f (q, y)), the same is manifestly not true for the variables η and f . For fixed η and q, respectively, for both the (Y-indexed) vector R(y, η) and the vector f (q, y), there are only r − 1 free components, since in both cases the components must sum to 1. Accordingly, as in Friedman’s analysis, here it makes sense to reduce the number of free variables to 2r − 2 by expressing one of the R components in terms of the others and similarly for one of the f components. A priori, it is not clear which such components should be reexpressed this way. Here I will choose to reexpress the component y∗ (z) ≡ argmaxy R(y, E[η | z]) this way for both R and f . For Friedman’s R(·, ·), this y∗ (z) is equivalent to the y maximizing E([ηy | z]). (So in the example above of the Friedman
1232
David H. Wolpert
effect, y∗ (z) = class 1.) Accordingly, from now on I will replace R(y∗ (z), η) with 1 − 6y6=y∗ (z) R(y, ηy ) and I will replace f (q, y∗ (z)) with 1 − 6y6=y∗ (z) f (q, y) wherever those terms appear. (From now on, when the context makes z clear, I will write y∗ rather than y∗ (z).) Having made these replacements, we can write C = 6y6=y∗ R(y, η) + 6y6=y∗ f (q, y) © ª − 6y6=y∗ R(y, η) × 6y6=y∗ f (q, y) − 6y6=y∗ R(y, η) f (q, y), which we can rewrite as C = 6y6=y∗ R(y, η) + 6y6=y∗ f (q, y) − Sy6=y∗ (R(y, ηy ), f (q, y)), if we use the shorthand S{i} (g(i), h(i)) ≡ 6i g(i) + 6i h(i) + 6i g(i)h(i). Now we must determine what noise plus bias is. Since η and f do not live in the same space, not all of the desiderata listed previously are well defined. Nonetheless, we can satisfy their spirit by taking noise plus bias to be E[C(., E[η | z], q) | z), that is, by taking it to be the expected value of C when one guesses using the expected output of the learning algorithm. In particular, this definition makes sense in light of desiderata 3. Writing it out, by using the multilinearity of S(·, ·) we see that for this definition noise plus bias is given by 6y6=y∗ R(y, E[η | z]) + 6y6=y∗ E[F(Q, y) | z] −Sy6=y∗ (R(y, E[η | z]), E[F(Q, y) | z]). R (As a point of notation, E[F(Q, y) | z] means df 6q f (q, y)P( f, q | z); it is the y component of the z-conditioned average target for average q.) As an example, for the zero-one loss R(·, ·) Friedman considers, R(y, E[η | z]) = 0 for all y 6= y∗ . Therefore noise plus bias reduces to 6y6=y∗ E[F(Q, y) | z]. For his z, { f, m, q}, this is just 1 − f (q, y∗ ). So for Friedman’s r = 2 scenario, if the class corresponding to the learning algorithm’s average η1 is the same as the target’s average class y∗ , noise plus bias is miny f (q, y). Otherwise it is maxy f (q, y). If we identify miny f (q, y) with the noise term, this means that the bias equals either zero or | f (q, y = 1) − f (q, y = 2)|, depending on whether the class corresponding to the learning algorithm’s average η1 is the same as the target’s average class. Continuing with our more general analysis, the difference between E(C | z) and noise plus bias is 6y6=y∗ E[R(y, η) | z] − R(y, E[η | z]) − {E[Sy6=y∗ (R(y, η), F(Q, y)) | z] − Sy6=y∗ (R(y, E[η | z]), E[F(Q, y) | z])}.
(7.1)
On Bias Plus Variance
1233
For many R the expression on the first line of equation 7.1 is nonnegative. For example, due to the definition of y∗ , this is the case for the zero-one R. In addition, that expression does not involve f directly, equals zero when the learning algorithm’s guesses never changes, and so forth. Accordingly, that expression (often) meets the usual desiderata for a variance and will here be identified as a variance. If everything else is kept constant while this variance is increased, then expected error also increases; there is none of the peculiar behavior Friedman found if one identifies variance with this expression from equation 7.1. For the zero-one R, this variance term is the probability that the y maximizing ηy is not the one that maximizes the expected ηy . The remaining terms in equation 7.1 collectively act like the negative of a covariance. Indeed, for the case Friedman considers where r = 2, if we define ∼ y∗ as the element of Y that differs from y∗ , we can write (the negative of) those terms as 2{E[R(∼ y∗ , η) × F(Q, ∼ y∗ ) | z] − R(∼ y∗ , E[η | z]) × E[F(Q, ∼ y∗ ) | z]}. Note the formal parallel between this expression for the remaining terms in equation 7.1 and the functional forms of the covariance terms in the biasvariance decompositions for other costs that were presented above. Of course, the parallel is not exact. In particular, this expression is not exactly a covariance; a covariance would have an E[R | z] × E[ f | z] type term rather than an R(. . . , E[. . . | z]) × E[F | z] term. The presence of the R(·, ·) functional is also somewhat peculiar. Indeed, the simple fact that the covariance term is nonzero for z = { f, m, q} is unusual (see the quadratic loss decomposition, for example). Ultimately, all these effects follow from the fact that the decomposition considered here does not treat targets and hypotheses the same way; targets are represented by f ’s, whereas hypotheses are represented by η’s rather than h’s. (Recall that for Friedman’s scenario’s R, h is determined by the maximal component of η.) If instead hypotheses were represented by h’s, then the zero-one loss decomposition would involve a proper covariance, without any R(·, ·) functional (Kohavi & Wolpert, 1996). It is the covariance term, not the variance term, that results in the possibility that an increase in variability reduces expected generalization error. To see this, along with Friedman, take z to equal { f, m, q}. Then our covariance term becomes 2{E[R(∼ y∗ , η) | f, m, q] × f (q, ∼ y∗ ) − R(∼ y∗ , E[η | f, m, q]) × f (q, ∼ y∗ )}. (Note that for the zero-one R(·, ·), R(∼ y∗ , E[η | f, m, q]) = 0, since by definition ∼ y∗ is not the Y value that maximizes E[η | f, m, q].)
1234
David H. Wolpert
Again following along with Friedman, have E[η | f, m, q] fixed while increasing variability. Next presume that that increase in variability increases E[R(∼ y∗ , η) | z], as Friedman does. (E[R(∼ y∗ , η) | z] gives the amount of “spill” of the learning algorithm’s guess η1 into ∼ y∗ , the output label other than the one corresponding to the algorithm’s average η1 .) Doing all this will always decrease the contribution to the expected error arising from the covariance term. It will also always increase the variance term’s contribution to the expected error. So long as f (q, ∼ y∗ ) > 1/2, the former phenomenon will dominate the latter, and overall, expected error will decrease. This condition on f (q, ∼ y∗ ) is exactly the one that corresponds to Friedman’s “peculiar behavior”; it means that the target is weighted toward the opposite Y value from the one given by the expected guess of the learning algorithm. So whether the peculiar behavior holds when one increases variability depends on whether the associated increase in variance manages to offset the associated increase in covariance (this latter increase resulting in a decrease in contribution to expected error). As in so much else having to do with the bias-variance decomposition, we have a classical trade-off between two competing phenomena, both arising in response to the same underlying modification to the learning problem. There is nothing peculiar in this, but in fact only classical bias-variance phenomenology. Nonetheless, it should be noted that for zero-one R, the covariance just equals 2 f (q, y∗ ) times the variance. Moreover, for zero-one R and z={ f, m, q}, noise plus bias = 1 − f (q, y∗ ). So if noise and bias are held constant, it is impossible to change the variance while everything else is held constant; the covariance will necessarily change as well, assuming the bias and noise do not. This behavior should not be too surprising. Since the zero-one loss can take on only two values, one might expect that it is not possible for all four of noise, bias, variance, and covariance to be independent. The reason for the precise dependency among those terms encountered here can be traced to two factors: choosing z to equal { f, m, q}, and performing the analysis in terms of η rather than h. For decompositions involving h (rather than η), the covariance term relates variability in h to variability in f , and therefore must vanish for the fixed f induced by having z equal { f, m, q}. However, in this section we are doing the analysis in terms of η, and η does not have the formal equal footing with f that h does. So rather than a proper covariance between η and f , we instead have here an unconventional “covariancelike” term. And this term does not disappear even for fixed f . As a sort of substitute, the covariance-like term instead is uniquely determined by the values of the noise, bias, and variance, if f is fixed. 8 Bias Plus Variance for Logarithmic Scoring 8.1 The { f, m, q}-Conditioned Decomposition. To begin the analysis of bias-plus-variance for logarithmic scoring, consider the case where z =
On Bias Plus Variance
1235
{ f, m, q}. The logarithmic scoring rule is related to Kullback-Liebler distance, and is given by E(C | f, h, q) = −6y f (q, y) ln[h(q, y)], so
Z E(C | f, m, q) = −6y f (q, y)6d P(d | f, m)
dhP(h | d) ln[h(q, y)].
Unlike quadratic loss, logarithmic scoring is not symmetric under f ↔ h. This scoring rule (sometimes instead referred to as the “log loss function” and proportional to the value of the “logarithmic score function” for an infinite test set) can be appropriate when the output of the learning algorithm h is meant to be a guess for the entire target distribution f (Bernardo & Smith, 1994). This is especially true when Y is a categorical rather than a numeric space. To understand that, consider the case where you guess h and have some test set T generated from f that you wish to use to score h. How to do that? One obvious way is to score h as the log likelihood of T given h. If we now have T be infinite; take the geometric mean of the log-likelihood to get it to be non-zero; and average over all T generated from f , we get logarithmic scoring. Note that logarithmic scoring can be used even when there is no metric structure on Y (as there must be for quadratic loss to be used). In creating an analogy of the bias-plus-variance formula for cases where C is not given by quadratic loss, one would like to meet conditions 1 through 4 and a through c presented above. However often there is no such thing as E(YF | f, q) when logarithmic scoring is used (i.e., often Y is not a metric space when logarithmic scoring is used). So we cannot measure intrinsic noise relative to E(YF | f, q), as in the quadratic loss bias-plus-variance formula. This means that meeting conditions 1 through 4 will be impossible; the best we can do is come up with a formula whose terms meet desiderata a through c and that have analogous behavior in some sense to the conditions 1 through 4. One natural way to measure intrinsic noise for logarithmic scoring is as the Shannon entropy of f , σls; f,m,q ≡ −6y f (q, y) ln[ f (q, y)].
(8.1)
Note that this definition meets all three parts of desideratum a. It is also the expected error of f at guessing itself, in close analogy to the intrinsic noise term for quadratic loss (see condition 1). To form an expression for logarithmic scoring that is analogous to the bias2 term for quadratic loss, we cannot directly start with the terms involved in quadratic loss because they need not be properly defined (e.g., E(YF | z) is not defined for categorical spaces Y, even though logarithmic scoring is perfectly well defined for that case). To circumvent this difficulty,
1236
David H. Wolpert
first note that for quadratic loss, expected Y values are the modes of optimal hypotheses (recall that hypotheses are distributions, and note that for quadratic loss, minimizing expected loss means taking a mean Y value in general). In addition, squares of differences between Y values are expected costs. Keeping this in mind, the bias2 term for quadratic loss can be rewritten as the following sum: 6y,y0 L(y, y0 )δ(y, E(YF | f, q, m))δ(y, E(YH | f, q, m)). This is the expected quadratic loss between two X-conditioned distributions over Y given by the two delta functions. The first of those distributions is what the optimal hypothesis would be if the target were given by E(F | z) (for z = { f, m, q}, E(F | z) = f ). More formally, the first of the two distributions is δ(y, E(YF | z)) = argminh E(C | f = E(F | z), m, q, h). I will indicate this by defining Opt(u) ≡ argminh E(C | f = u, m, q, h), where u is any distribution over Y. So this first of our two distributions is Opt(E(F | z)). R For z = { f, m, q}, P(yF | f = E(H | z), q, m) = R f (q, yF ) evaluated for f = dhP(h | f, m, q)h; that is, it equals the vector dhhP(h | f, m, q) evaluated for indices q Rand yF . By the usual properties of vector spaces, this can be rewritten as dhh(q, yF )P(h | f, m, q). However, this is just P(yH | f, q, m) evaluated for yH = yF . Accordingly, E(YH | f, q, m) = E(YF | f = E(H | z), q, m). So the second of the two distributions in our expression for bias2 is what the optimal hypothesis would be if the target were given by E(H | z): argminh E(C | f = E(H | z), m, q, h). We can indicate this second optimal hypothesis by Opt(E(H | z)). Combining, we can now rewrite the bias2 for quadratic loss as E(C | f = Opt(E(F | z)),
h = Opt(E(H | z)), q, m).
Unlike the original expression for bias2 for quadratic loss, this new expression can be evaluated even for logarithmic scoring. The Opt(·) function is different for logarithmic scoring and quadratic loss. For logarithmic scoring, by Jensen’s inequality, Opt(u) = u. (Scoring rules obeying this property are sometimes said to be proper scoring rules.) Accordingly, our bias term can be written as E(C | f = E(F | z), h = E(H | z), q). For z = { f, m, q}; this reduces to E(C | f, h = E(H | f, m, q), q). As an aside, note that for quadratic loss, this same expression would instead be identified with noise plus bias. This different way of interpreting the same expression reflects the difference between measuring cost by comparing yH and yF versus doing it by comparing h and f . Writing it out, the bias term for logarithmic scoring is Z −6y f (q, y) ln{6d P(d | f, m)
dhP(h | d)h(q, y)} .3
On Bias Plus Variance
1237
One difficulty with this expression is that its minimal value over all learning algorithms is greater than zero. To take care of that, I will subtract from this expression the additive constant of its minimal value. That minimal value is given by the learning algorithm that always guesses h = f , independent of the training set. Note that this minimal value is just our noise term. So another way to arrive at our choice for the bias term is to simply identify E(C | f, h = E(H | f, m, q |, q) with noise plus bias, its value for quadratic loss. Accordingly, our final measure for the “bias” for logarithmic scoring is the Kullback-Leibler distance between the distribution f (q, ·) and the average h(q, ·). With some abuse of notation, this can be written as follows: biasls; f,m,q ≡ −6y f (q, y) ln[E(H(q, y) | f, m)/f (q, y)] R ¸ · 6d P(d | f, m) dhP(h | d)h(q, y) . = −6y f (q, y) ln f (q, y)
(8.2)
This definition of biasls; f,m,q meets desideratum b. Given these definitions, the variance for logarithmic scoring and conditioning on f , m, and q (so there should not be a covariance term), variancels; f,m,q , is fixed, and given by variancels; f,m,q ≡ −6y f (q, y){E(ln[H(q, y)] | f, m) − ln[E(H(q, y) | f, m)]} Z = −6y f (q, y){6d P(d | f, m) dhP(h | d) ln[h(q, y)] Z − ln[6d P(d | f, m) dhP(h | d)h(q, y)]}. (8.3) Combining, for logarithmic scoring, E(C | f, m, q) = σls; f,m,q + biasls; f,m,q + variancels; f,m,q .
(8.4)
It is straightforward to establish that variancels; f,m,q meets the requirements in desideratum c. First, consider the case where P(h | d) = δ(h−h0 ) for some h0 (this delta function is the multidimensional Dirac delta function). In this case the term inside the curly brackets in equation 8.3 just equals 3
Note that E(C | f, H = E(H | f, m, q), q) for quadratic loss is
Z
6yH , yF L(yH , yF ) f (q, yF )
dhP(h | d)P(d | f, m)h(q, yH ).
However, this just equals E(C | f, m, q), rather than (as for logarithmic scoring) E(C | f, m, q) minus a variance term. This difference between the two cases reflects the fact that whereas expected error for loss functions is linear in h in general, expected error for scoring rules is not.
1238
David H. Wolpert
ln[h0 (q, y)] − ln[h0 (q, y)] = 0. So variancels does equal zero when the guess h is independent of the training set d. (In fact, it equals zero even if h is not single valued, such single-valued h being the precise case desideratum c refers to.) Next, since the log is a concave function, we know that the term inside the curly brackets is never greater than zero. Since f (q, y) ≥ 0 or all q and y, this means that variancels; f,m,q ≥ 0 always. Finally, we can examine the P(h | d) that make variancels large. Any h is an |X|-fold Cartesian product of vectors living on |Y|-dimensional unit simplices. Accordingly, for any d, P(h | d) is a probability density function in a Euclidean space. To simplify matters further, assume that P(h | d) is deterministic, so it specifies a single unique distribution h for each d, indicated by hd . Then the term inside the curly brackets in equation 8.3 equals 6d P(d | f, m) ln[hd (q, y)] − ln[6d P(d | f, m)hd (q, y)]. This is the difference between an average of a function and the function evaluated at the average. Since the function in question is concave, this difference grows if the points going into the average are far apart. That is, to have large variancels; f,m,q , the hd (q, y) should differ markedly as d varies. This establishes the final part of desideratum c. 8.2 Alternative { f, m, q}-Conditioned Decompositions. The approach taken here to deriving a bias-plus-variance formula for logarithmic scoring is not perfect. For example, the formula for variancels; f,m,q is not identical to the formula for σls; f,m,q under the interchange of F with H (as is the case for the variance and intrinsic noise terms for quadratic loss). In addition, variancels; f,m,q can be made infinite by having hd (q, y) = 0 for one d and one y, assuming both f (q, y) and P(d | f ) are nowhere zero. Although not surprising given that we are interested in logarithmic scoring, this is not necessarily desirable behavior in a variance-like quantity. Other approaches tend to have even more major problems, however. For example, as an alternative to the approach taken here, one could imagine trying to define a variance first, and then define bias by requiring that the bias plus the variance plus the noise gives the expected error. It is not clear how to follow this approach, however. In particular, one natural definition of variance would be the average cost between h and the average h, Z − dhP(h | f, m, q)6y E(H | f, m, q) ln[h(q, y)]. (Cf. the formula at the beginning of this section giving E(C | f, m, q) for logarithmic scoring.) This can be rewritten as Z 0 −6y {6d0 P(d | f, m) dhP(h | d0 )h(q, y)}6d P(d | f, m) Z dh0 P(h0 | d) ln[h0 (q, y)].
On Bias Plus Variance
1239
However, consider the case where P(h | d) = δ(h− f ) for all d for which P(d | f, m) 6= 0. With this alternative definition of variance, in such a situation we would have the variance equaling −6y f (q, y) ln[ f (q, y)] = σls; f,m,q , not zero. (Indeed, just having P(h | d) = δ(h − h0 ) for some fixed h0 for all d for which P(d | f, m) 6= 0 suffices to violate our desiderata, since this in general will not result in zero variance.) Moreover, in this scenario, the variance would also equal E(C | f, m, q). So for this scenario, bias = E(C | f, m, q) − variance − σls; f,m,q would equal −σls; f,m,q , not zero. This violates our desideratum concerning bias. Yet another possible formulation of the variance would be (in analogy to the formula presented above for logarithmic scoring intrinsic noise) the Shannon entropy of the average H, E(H | z). But again, this formulation of variance would violate our desiderata. In particular, for this definition of variance, having h be independent of d need not result in zero variance. 8.3 Corrections to the Decomposition When z 6= { f, m, q}. Finally, just as there is an additive Bayesian correction to the { f, m, q}-conditioned quadratic loss bias-plus-variance formula, there is also one for the logarithmic scoring formula. As useful shorthand, write Z E(F(·, y) | z) ≡ 6q = 6q E(H(·, y) | z) ≡ 6q = 6q
Z Z Z
df P( f, q | z) × f (q, y) df P(q | z)P( f | q, z) f (q, y), dhP(h, q | z) × h(q, y), Z dh6d df P(q | z)P( f | q, z)
× P(d | f, q, z)P(h | d, q, z)h(q, y), Z E(ln[H(·, y)] | z) ≡ 6q dhP(h, q | z) × ln[h(q, y)] Z Z = 6q dh6d df P(q | z)P( f | q, z)P(d | f, q, z) × P(h | d, q, z) ln[h(q, y)], and Z E(F(·, y) ln[H(·, y)] | z) ≡ 6q = 6q
Z
df dhP(h, f, q | z) × f (q, y) ln[h(q, y)] df dh6d P(q | z)P( f | q, z)P(d | f, q, z)
P(h | d, q, z) f (q, y) ln[h(q, y)] .
1240
David H. Wolpert
Then simple algebra verifies the following: E(C | z) = σls;z + biasls;z + variancels;z + covls;z ,
(8.5)
where σls;z ≡ −6y E(F(·, y) | z) ln[E(F(·, y) | z)], ¸ · E(H(·, y) | z) biasls;z ≡ −6y E(F(·, y) | z) ln , E(F(·, y) | z) variancels;z ≡ −6y E(F(·, y) | z){E(ln[H(·, y)] | z) − ln[E(H(·, y) | z)]}, and covls;z ≡ −6y E(F(·, y) ln[H(·, y)] | z) − E(F(·, y) | z) ×E(ln[H(·, y)] | z). Note that in equation 8.5 we add the covariance term rather than subtract it (as in equation 3.1). Intuitively, this reflects the fact that − ln(·) is a monotonically decreasing function of its argument, in contrast to (·)2 . However even with the sign backward, the covariance term as it occurs in equation 8.5 still means that if the learning algorithm tracks the posterior—if when f (q, y) rises, so does h(q, y)—then the expected cost is smaller than it would be otherwise. 9 Bias Plus Variance for Quadratic Scoring 9.1 The { f, m, q}-Conditioned Decomposition. In quadratic scoring C( f, h, q) = 6y [ f (q, y)−h(q, y)]2 . This is not to be confused with the quadratic score function, which has C(yF , h, q) = 1 − 6y [h(q, y) − δ(y, yF )]2 . This score function can be used only when one has a finite test set, which is not the case for quadratic scoring. (See Bernardo & Smith, 1994.) Analysis of bias plus variance decompositions for the quadratic score function is the subject of future work. For quadratic scoring, for z = { f, m, q}, the lower bound on an algorithm’s error is zero: guess h to equal f always, regardless of d. Accordingly (see desideratum a), the intrinsic noise term must equal zero. To determine the bias term for quadratic scoring, employ the same trick used for logarithmic scoring to write that term as E(C | f = Opt(E(F | z)), h = Opt(E(H | z)), q). Again as with logarithmic scoring, for any Xconditioned distribution over Y, u, Opt(u) = u. Accordingly, for quadratic scoring, our bias term is E(C | f = E(F | z), h = E(H | z), q). For z = { f, m, q}, this reduces to E(C | f, h = E(H | f, m, q), q) = 6y [ f (q, y) − E(H(q, y) | f, m, q)]2 : biasqs; f,m,q = 6y [P(YF = y | f, m, q) − P(YH = y | f, m, q)]2 . As with logarithmic scoring, we then set variance to be the difference between E(C | f, m, q) and the sum of the intrinsic noise and bias terms. So
On Bias Plus Variance
1241
for quadratic scoring, E(C | f, m, q) = σqs; f,m,q + biasqs; f,m,q + varianceqs; f,m,q , where σqs; f,m,q ≡ 0, biasqs; f,m,q ≡ 6y [P(YH = y | f, m, q) − P(YF = y | f, m, q)]2 , and
Z dhP(h | f, m, q)[h(q, y)]2 Z − 6y [ dhP(h | f, m, q)h(q, y)]2 .
varianceqs; f,m,q ≡ 6y
As usual, these are in agreement with desiderata a through c. Interestingly, biasqs; f,m,q is the same as the bias2 for zero-one loss for z = { f, m, q} (see Kohavi & Wolpert, 1996 and Wolpert & Kohavi, 1997). 9.2 The Arbitrary z Decomposition. To present the general-z case, recall the definitions of E(F(·, y) | z) and E(H(·, y) | z) made just before equation 8.5. In addition, define Z E(F2 (·, y) | z) ≡ 6q
Z
2
E(H (·, y) | z) ≡ 6q E(F(·, y)H(·, y) | z) ≡ 6q
Z
df P( f, q | z) × f 2 (q, y), dhP(h, q | z) × h2 (q, y), and df dhP(h, f, q | z) × f (q, y)h(q, y).
Then we can write E(C | z) = σqs;z + biasqs;z + varianceqs;z − 2covqs;z ,
(9.1)
where σqs;z ≡ 6y {E(F2 (·, y) | z) − [E(F(·, y) | z)]2 }, biasqs;z ≡ 6y [P(YH = y | z) − P(YF = y | z)]2 , varianceqs;z ≡ 6y {E(H2 (·, y) | z) − [E(H(·, y) | z]2 }, and covqs;z ≡ 6y {E(F(·, y)H(·, y) | z) − E(F(·, y) | z) × E(H(·, y) | z)}. As usual, these decompositions meet essentially all of desiderata a through c.
1242
David H. Wolpert
10 Future Work Future work consists of the following projects: 1. Investigate the real-world manifestations of the Bayesian correction to bias plus variance for quadratic loss. (For example, it seems plausible that whereas the bias-variance trade-off involves things like the number of parameters involved in the learning algorithm, the covariance term may involve things like model misspecification in the learning algorithm.) 2. Investigate the real-world manifestations of the bias-variance tradeoff for the logarithmic and quadratic scoring definitions of bias and variance used here. 3. See if there are alternative definitions of bias and variance for logarithmic and quadratic scoring that meet our desiderata. More generally, investigate how unique bias-plus-variance decompositions are. 4. Investigate what aspects of the relationship between C and the other random variables (like YH and YF ) are necessary for there to be a bias plus variance decomposition for E(C | f, m, q) that meets conditions 1 through 4 and a through c. Investigate how those aspects change as one modifies z and/or the conditions one wishes the decomposition to meet. 5. The EBF provides several natural z-dependent ways to determine how close one learning algorithm is to another. For example, one could define the distance between learning algorithms A and B, 1(A, B), as the mutual information between the distributions P(c | z, A) and P(c | z, B). One could then explore the relationship between 1(A, B), and how similar the terms in the associated bias-plus-variance decompositions are. 6. Investigate the “data worth” for other z’s besides {m, q} and/or other C’s besides the quadratic loss function. In particular, for what C’s will all our desiderata be met and yet the intrinsic noise be given exactly by the error associated with the Bayes-optimal learning algorithm? 7. Instead of going from a C(F, H, Q) to definitions for the associated bias, variance, and so on, do things in reverse; that is, investigate the conditions under which one can back out to find an associated C(F, H, Q), given arbitrary definitions of bias, intrinsic noise, variance and covariance (arbitrary within some class of “reasonable” such definitions). Acknowledgments I thank Ronny Kohavi, Tom Dietterich, and Rob Schapire for getting me interested in the problem of bias plus variance for nonquadratic loss func-
On Bias Plus Variance
1243
tions, and Ronny Kohavi and David Rosen for helpful comments on the manuscript. This work was supported in part by the Santa Fe Institute and in part by TXN Inc. References Bernardo, J., & Smith, A. (1994). Bayesian theory. New York: Wiley and Sons. Breiman, L. (1994). Bagging predictors (Tech. Rep. 421). Berkeley: Department of Statistics, University of California. Breiman, L. (1996). Bias, variance, and arcing classifiers. Unpublished manuscript. Buntine, W., & Weigend, A. (1991). Bayesian back-propagation. Complex Systems 5: 603–643. Friedman, J. (1996). On bias, variance, 0/1-loss, and the curse of dimensionality. Unpublished manuscript. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation 4: 1–58. Kohavi, R., & Wolpert, D. (1996). Bias plus variance for zero-one loss functions. In Proceedings of the 13th International Machine Learning Conference. San Mateo, CA: Morgan Kauffman. Kong, E. B., & Dietterich, T. G. (1995). Error-correcting output coding corrects bias and variance. In Proceedings of the 13th International Conference on Machine Learning (pp. 314–321). San Mateo, CA: Morgan Kauffman. Perrone, M. (1993). Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization. Unpublished doctoral dissertation, Brown University, Providence, RI. Tibshirani, R. (1996). Bias, variance, and prediction error for classification rules. Unpublished manuscript. Wolpert, D. (1994). Filter likelihoods and exhaustive learning. In S. J. Hanson et al. (Eds.), Computational Learning Theory and Natural Learning Systems II (pp. 29–50). Cambridge, MA: MIT Press. Wolpert, D. (1995). The relationship between PAC, the statistical physics framework, the Bayesian framework, and the VC framework. In D. Wolpert (Ed.), The Mathematics of Generalization. Reading, MA: Addison-Wesley. Wolpert, D. (1996a). The lack of a priori distinctions between learning algorithms. Neural Computation 8: 1341. Wolpert, D. (1996b). The existence of a priori distinctions between learning algorithms. Neural Computation 8: 1391. Wolpert, D., & Kohavi, R. (1997). The mathematics of the bias-plus-variance decomposition for zero-one loss functions. In preparation.
Received September 1, 1995; accepted September 5, 1996.
NOTE
Communicated by David Wolpert
Note on Free Lunches and Cross-Validation Cyril Goutte Department of Mathematical Modelling, Technical University of Denmark, DK-2800 Lyngby, Denmark
The “no-free-lunch” theorems (Wolpert & Macready, 1995) have sparked heated debate in the computational learning community. A recent communication (Zhu & Rohwer, 1996) attempts to demonstrate the inefficiency of cross-validation on a simple problem. We elaborate on this result by considering a broader class of cross-validation. When used more strictly, cross-validation can yield the expected results on simple examples. 1 Introduction A recent contribution to computational learning, and neural networks in particular, the no-free-lunch (NFL) theorems (Wolpert & Macready, 1995) give surprising insight into computational learning schemes. One implication of the NFL is that no learning algorithm performs better than random guessing over all possible learning situations. This means in particular that the widely used cross-validation (CV) methods, if successful in some cases, should fail on others. Considering the popularity of these schemes, it is of considerable interest to exhibit such a problem where the use of CV leads to a decrease in performance. Zhu and Rohwer (1996) propose a simple setting in which a “cross-validation” method yields worse results than the maximum likelihood estimator it is based on. In the following, we extend these results to a stricter definition of cross-validation and provide an analysis and discussion of the results. 2 Experiments The experimental setting is the following: a gaussian variable x has mean µ and unit variance. The mean should be estimated from a sample of n realization of x. Three estimators are compared: A(n): The mean of the n point sample. It is the maximum likelihood and least mean squares estimator, as well as the optimal unbiased estimator. B(n): The maximum of the n point sample. Neural Computation 9, 1245–1249 (1997)
c 1997 Massachusetts Institute of Technology °
1246
Cyril Goutte
C(n): The estimator obtained by a cross-validation choice between A(n) and B(n). The original “cross-validation” setting (Zhu & Rohwer, 1996) will be noted Z(n + 1); it samples one additional point and chooses the estimator A or B that is closer to this point. However, the widely used concept of cross-validation (Ripley, 1996; FAQ, 1996) corresponds to resampling and averaging estimators rather than sampling additional points. In this context, the estimator proposed in Zhu and Rohwer (1996) is closer to the splitsample, or hold out, method. This method is known to be noisy, especially when the validation set is small, which is the case here. On the other hand, a thorough cross-validation scheme would use several validation sets resampled from the data and average over them before choosing estimator A or B. In the leave-one-out (LOO) flavor, the CV score is calculated as the average distance between each point and the estimator obtained on the rest. Note that in this setting, estimator Z operates with more information (one supplementary data point) than the LOO estimator. The result of an experiment done with n = 16 and 106 samples gives mean squared errors: A(16) : 0.0624
B(16) : 3.4141
CLOO (16) : 0.0624
Z(16 + 1) : 0.5755
In this case, it seems that the proper cross-validation procedure always picks estimator A, whose theoretical mean squared error is 1/16 = 0.0625. 3 Short Analysis The simple setting used in these experiments allows for a full analysis of the behavior of CLOO . Consider n data points xi . The LOO cross-validation estimate is computed by averaging the squared error between each point and the average of the rest. Let us denote by x the average of all xi and by xk the average of all xi excluding example xk : xk =
nx − xk . n−1
Accordingly, (xk − xk ) =
n (x − xk ) , n−1
hence the final cross-validation score for estimator A: µ CV =
n n−1
¶2 S (x) ,
(3.1)
P where S(w) = n1 ni=1 (w − xi )2 is the mean squared error between estimator w and the data. Let us now note x∗ = max{xi } the maximum of the (aug-
Free Lunches and Cross-Validation
1247
mented) sample, and x∗∗ = max({xi }\x∗ ) the second largest element. The cross-validation score for estimator B is: ¡ ¢ 1¡ ∗ ¢2 x − x∗∗ . CV ∗ = S x∗ + n
(3.2)
In order to realize how the cross-validation estimator behaves, let us first recall that x is the least mean squared estimator (MSE). As such, it minimizes S (w). Furthermore, from Huygens’s formula, S (x∗ ) = S (x) + (x − x∗ )2 . Accordingly, we can rewrite ¢2 ¢2 ¡ 1 ¡ ∗ x − x∗∗ . CV ∗ = S (x) + x − x∗ + n+1 These observations show that in order for equation 3.2 to be lower than 3.1 requires an extremely unlikely situation. x∗ should be quite close to both x and x∗∗ . One such situation could arise in the presence of a negative outlier. Because the mean is not robust, the outlier would produce a severe downward bias in estimator A. Thus, the maximum could very well be a better estimator than the mean in such a case. However, no such happenstance was observed in 5.106 experiments. 4 Discussion 1. In the experiment of section 2, the LOO estimator does not use the additional data point allotted to C. When using this data point, the performance is identical. CV does not extract any information from the extra sample, but at least manages to keep the information in the original sample. 2. The cross-validation estimator does not perform better than A(16) and yields worse performance than A(17), for which the theoretical MSE is 1/17 ' 0.0588. However, the setting imposes a choice between A and B. The optimal choice over 105 samples leads to an MSE of just 0.0617. This is a lower bound for estimator C and is significantly beyond the optimal seventeen-points estimator. On the other hand, a random choice between A and B leads to an MSE of 1.7364 on the same sample. 3. It could be objected that LOO is just one more flavor of cross-validation, so results featuring this particular estimator do not necessarily have any relevance to CV methods as a whole. Let us then compare the performance of m-fold CV on the original sixteen-point sample. It consists in dividing the sample into m validation sets and averaging the validation performance of the m estimators obtained on the remaining data. For 105 samples, we get (CVm is the m-fold CV estimator): Estimator MSE
A
B
CV16 = CLOO
CV8
CV4
CV2
0.0624
3.4141
0.0624
0.0624
0.0624
0.0628
1248
Cyril Goutte
The slight decrease in performance in CV2 is due to the fact that we average over only two validation sets. If we resample two additional sets, the performance is identical to all other CV estimators. 4. While requiring additional computation, none of the CV estimators above gains anything on A (even with the help of one additional point). Better performance can be observed however with a different choice of estimators. Consider, for example, D(n), a median of the sample, and E(n), the average between the minimum and the maximum. Using 105 samples, we compare the CV estimator to D and E calculated on sixteen-point samples: D(16) : 0.0949
E(16) : 0.1529
CLOO (16) : 0.0931
Here cross-validation outperforms both estimators it is based on. 5 Conclusion The NFL results imply that for every situation where a learning method performs better than random guessing, another situation exists where it performs correspondingly worse. Numerous reports of successful applications of usual learning procedures suggest that the “unspecified prior” under which they outperform random guessing verifies in a number of practical cases. Exhibiting these assumptions is of importance in order to check whether the conditions for success hold when tackling a new problem. However, it is unlikely that such a simple setting could challenge these yet unknown assumptions. Cross-validation has many drawbacks, and it is far from being the most efficient learning method (even among non-Bayesian frameworks). In that simple case, though, it provides perfectly decent results. We now know that there is no free lunch for cross-validation. However, the task of exhibiting an easily understandable, nondegenerate case where it fails has yet to be completed. Furthermore, the task of exhibiting the hidden prior under which cross-validation is beneficial provides challenging prospects for the future. Acknowledgments This work was partially supported by a research fellowship from the Technical University of Denmark, and partially performed at LAFORIA (URA 1095), Universit´e Pierre et Marie Curie, Paris. I am grateful to the referees for constructive criticism and to Huaiyu Zhu and Lars Kai Hansen for challenging remarks on drafts of the article.
Free Lunches and Cross-Validation
1249
References FAQ. (1996). Neural networks frequently asked questions. ftp://ftp.sas.com/pub/neural/FAQ3.html. FAQ in comp.ai.neural-nets, part 3. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Wolpert, D. H., & Macready, W. G. (1995). The mathematics of search, (Tech. Rep. No. SFI-TR-95-02-010). Santa Fe: Santa Fe Institute. Zhu, H., & Rohwer, R. (1996). No free lunch for cross validation. Neural Computation, 8(7), 1421–1426.
Received October 9, 1996; accepted February 4, 1997.
LETTERS
Communicated by Paul Bush
Gamma Oscillation Model Predicts Intensity Coding by Phase Rather than Frequency Roger D. Traub IBM Research Division, T. J. Watson Research Center, Yorktown Heights, NY 10598, and Department of Neurology, Columbia University, New York, NY 10032, U.S.A.
Miles A. Whittington Department of Physiology, Imperial College School of Medicine at St. Mary’s, London W2 1PG, U.K.
John G. R. Jefferys Department of Physiology, University of Birmingham School of Medicine, Birmingham B15 2TT, U.K.
Gamma-frequency electroencephalogram oscillations may be important for cognitive processes such as feature binding. Gamma oscillations occur in hippocampus in vivo during the theta state, following physiological sharp waves, and after seizures, and they can be evoked in vitro by tetanic stimulation. In neocortex, gamma oscillations occur under conditions of sensory stimulation as well as during sleep. After tetanic or sensory stimulation, oscillations in regions separated by several millimeters or more occur at the same frequency, but with phase lags ranging from less than 1 ms to 10 ms, depending on the conditions of stimulation. We have constructed a distributed network model of pyramidal cells and interneurons, based on a variety of experiments, that accounts for near-zero phase lag synchrony of oscillations over long distances (with axon conduction delays totaling 16 ms or more). Here we show that this same model can also account for fixed positive phase lags between nearby cell groups coexisting with near-zero phase lags between separated cell groups, a phenomenon known to occur in visual cortex. The model achieves this because interneurons fire spike doublets and triplets that have average zero phase difference throughout the network; this provides a temporal framework on which pyramidal cell phase lags can be superimposed, the lag depending on how strongly the pyramidal cells are excited. 1 Introduction Gamma-frequency electroencephalogram (EEG) oscillations occur in cortical (and diencephalic) structures during a variety of behavioral states and stimulus paradigms (reviewed by Eckhorn, 1994; Gray, 1994; Singer & Gray, Neural Computation 9, 1251–1264 (1997)
c 1997 Massachusetts Institute of Technology °
1252
Roger D. Traub, Miles A. Whittington, and John G. R. Jefferys
1995). Examples for the hippocampus include the theta state (Soltesz & Deschˆenes, 1993; Bragin, Jando, ´ N`adasdy, Hetke, & Wise, 1995; Sik, Penttonen, Ylinen, & Buzs`aki, 1995), during which gamma activity is continuous; following limbic seizures (Leung, 1987); transient (few hundred ms) gamma following synchronized pyramidal cell firing, such as physiological sharp waves in vivo (Traub, Whittington, Colling, Buzs´aki, & Jefferys, 1996); epileptiform bursts, either in vivo or in vitro (Traub, Whittington, Colling, Buzs´aki, & Jefferys, 1996); and following brief tetanic stimulation in vitro (Whittington, Traub, & Jefferys, 1995; Traub, Whittington, Stanford, & Jefferys, 1996). Situations where gamma-frequency oscillations occur in the neocortex include sleep—both slow-wave sleep, in which gamma is localized to a spatial scale of 1–2 mm (Steriade, Amzica, & Contreras, 1995; Steriade, Contreras, Amzica, & Timofeev, 1996), and rapid eye movement sleep, in which gamma is widespread (Llin´as & Ribary, 1993); during complex motor performance in behaving monkeys (Murthy & Fetz, 1992); following auditory stimulation (Joliot, Ribary, & Llin´as 1994); and following visual stimulation (Gray & Singer, 1989; Frien, Eckhorn, Bauer, Woelbern, & Kehr, 1994). Sensory-evoked gamma oscillations have also been observed in the olfactory bulb (Adrian, 1950) and olfactory cortex (Freeman, 1959). The variety of states associated with gamma oscillations makes it unlikely that gamma serves a simple unified function. Somewhat along the lines of other authors (Singer & Gray, 1995), we shall adopt the following working hypothesis: gamma-frequency oscillations that occur sufficiently locally may perhaps be epiphenomenal, but gamma-frequency oscillations that are synchronized (or at least phase locked) over longer distances (say, several millimeters) act to coordinate pools of neurons devoted to a common task. This hypothesis is especially attractive in the visual system, because stimulation with a long bar elicits oscillations that are tightly synchronized (phase lags can be less than 1 ms) across distances of several millimeters in cortical sites specific for receptive field and orientation preference (Gray, Konig, ¨ Engel, & Singer, 1989). If the hypothesis is correct, it becomes important to work out the cellular mechanisms of gamma oscillations. This could provide experimental tools for manipulating the oscillation and insight into the usefulness for coding of phase versus frequency of oscillations in neuronal populations. With respect to this latter issue, Konig, ¨ Engel, Roelfsema, and Singer (1995) have shown that visual stimulus properties are not encoded solely by precisely synchronized oscillation of selected neuronal populations. Rather, certain optimally driven populations oscillate with near-zero phase difference, even though these populations are spatially separated, whereas other neuronal populations—near to the index population but less optimally driven (by virtue of off-angle orientation tuning)—oscillate with phase delays of up to 4.5 ms. This observation suggests that oscillation phase, but not frequency, could be used to encode stimulus features. In this article, we present a testable cellular mechanism for the observa-
Gamma Oscillation Model
1253
tion of Konig ¨ et al. (1995), based on an experimental and computer model of gamma oscillations. This model derives from hippocampal and neocortical slice studies, and has the following basic features: 1. Populations of synaptically connected GABAergic interneurons can generate synchronized gamma-frequency oscillations, without a requirement for ionotropic glutamate input from pyramidal cells, provided the interneurons are tonically excited (Whittington et al., 1995). This can be shown experimentally in hippocampal or neocortical slices, during pharmacological blockade of AMPA (α-amino-3-hydroxy-5methyl-4-isoxazole propionic acid) and NMDA (N-methyl-D-aspartate) receptors, provided there is stimulation of the interneurons with glutamate itself or with the metabotropic glutamate agonist (1S,3R) ACPD ((1S,3R)-1-aminocyclopentane-1,3,dicarboxylic acid). The physical mechanism by which this synchrony obtains is similar to that described by Wang and Rinzel (1993) for synchronization of GABAergic nucleus reticularis thalami neurons, at near-zero phase lag, by mutual inhibition; it is possible when the inhibitory postsynaptic conductance (IPSC) time course is slow compared to the intrinsic frequency of the uncoupled cells (“intrinsic frequency” as used here means the frequency at which the metabotropically excited cells would fire when synaptically uncoupled). (See also van Vreeswijk, Abbott, & Ermentrout, 1994.) 2. The synchronized firing of a population of pyramidal cells can elicit gamma-frequency oscillations in networks of interneurons, probably in part via a slow, metabotropic excitatory postsynaptic potential (EPSP) (Traub, Whittington, Colling, Busz´aki, & Jefferys, 1996). 3. Gamma-frequency network oscillations can occur simultaneously with the firing of pyramidal cells, for example, after tetanic stimulation in vitro (Traub, Whittington, Stanford, & Jefferys, 1996). 4. Long-range (at least 4 mm) synchrony of gamma-frequency network oscillations can occur with near-zero phase lags, despite axon conduction delays. This was predicted to occur, based on simulations, when interneurons fire spike doublets, the second spike being elicited by recurrent excitation from both nearby and more distant pyramidal cells. Experiments in the hippocampal slice then showed that interneurons do indeed fire doublets when oscillations occur at two distant sites with near-zero phase lag but fire singlets when oscillations only occur locally (Traub, Whittington, Stanford, & Jefferys, 1996). Also as predicted by this model, the first action potential in the interneuron doublet occurs simultaneously with the pyramidal cell action potential (Traub, Whittington, Stanford, & Jefferys, 1996; Whittington, Stanford, Colling, Jefferys, & Traub, in press). (The model explicated by Bush and Sejnowski (1996) produces tight synchrony between two
1254
Roger D. Traub, Miles A. Whittington, and John G. R. Jefferys
columns—also separated by conduction delays—without interneuron doublets but with interneurons firing short bursts. It is not known, however, whether longer-range synchrony can be produced by this model. It is possible that bursts of action potentials, fired by interneurons in the model of Bush and Sejnowski, have a physically similar effect to the doublets and triplets in our model and to the doublets in hippocampal experiments.) 5. In local networks (those without axon conduction delays) that generate gamma-frequency oscillations, mean excitation of pyramidal cells is encoded by firing phase relative to the population oscillation. This is a model prediction, not yet verified experimentally (Traub, Jefferys, & Whittington, in press). We shall demonstrate how, in a network of appropriate topology, the above properties of gamma oscillations can combine so that distant groups of cells oscillate in phase (when both groups are driven strongly), while at the same time, weakly driven cell groups oscillate at the same frequency but with a phase lag, thereby replicating the result of Konig ¨ et al. (1995). 2 Methods The approach to the network modeling is similar to that described elsewhere (Traub et al., in press; Traub, Whittington, Stanford, & Jefferys, 1996). Briefly, pyramidal cells (e-cells) are simulated as in Traub et al. (1994), and GABAergic interneurons (i-cells) as in Traub and Miles (1995). Pyramidal cells, while capable of intrinsic bursting, are driven by injection of sufficiently large currents to force them into a repetitive firing mode (Wong & Prince, 1981), consistent with observations on actual pyramidal cell firing patterns after tetanic stimulation in vitro (Whittington, Stanford, Colling, Jefferys, & Traub, in press). Synaptic interactions in the model are: 1. e → i, onto interneuron dendrites, with unitary AMPA receptor EPSP large enough to cause the i-cell to fire (Guly´as et al., 1993). The unitary AMPA receptor conductance is 8 te−t nS, t in ms. In addition, pyramidal cell firing elicits NMDA receptor-mediated EPSPs in interneurons, with kinetics and saturation as described previously (Traub et al., in press). Most likely, the dominant relevant EPSPs for gamma oscillations are metabotropic glutamate, but as long as the EPSPs are slow enough (and a relaxation time constant of 60 ms works), there is little functional significance of this distinction for the model. 2. i → e, onto pyramidal cell somata and proximal dendrites, as would be made by basket cells. The unitary IPSCs are GABAA receptormediated, with abrupt onset to 2 nS, then exponential decay with time-constant 10 ms (Whittington, Jefferys, & Traub, 1996).
Gamma Oscillation Model
1255
3. i → i, onto interneuron dendrites. These also are GABAA receptor mediated, with abrupt onset to 0.75 nS, then exponential decay with time-constant 10 ms (Traub, Whittington, Colling, Buzs´aki, & Jefferys, 1996). 4. e → e synaptic interactions are omitted for the sake of simplicity. Simulation of both local and distributed networks suggests that gamma oscillation properties persist when e → e connections are present, provided the conductances are not too large. In particular, recurrent excitation must be weak enough not to evoke dendritic calcium spikes and cellular bursts (Traub et al., in press). Metabotropic glutamate receptor activation, of likely importance for gamma oscillations, is expected to reduce excitatory synaptic conductances (Baskys & Malenka, 1991; Gereau & Conn, 1995), which might favor the stability of the oscillation. If recurrent excitation occurs in dendritic regions that do not develop full calcium spikes, stability of the oscillation is also favored (Traub, unpublished data). All three types of synaptic interaction (e → i, i → e, i → i) take place within columns and between columns. As before, pyramidal cells and interneurons are organized into groups (columns), in the present case each having eight pyramidal cells and eight interneurons. The global organization is shown in figure 1. Such an organization allows stimulation parameters and axon conduction delays to be specified readily. In addition, the network has a form that assumes functional significance. For example, each column might correspond to a visual stimulus orientation/receptive field. If the global visual stimulus is a long horizontal bar, it would excite columns A1 and B1 maximally, and the other columns less; we simulate this by using maximal driving currents to the cells of A1 and B1, with smaller driving currents to the other columns. Hence, current to A1,B1 > A2,B2 > A3,B3 > A4,B4. Specific details are as follows. Pyramidal cells and interneurons receive synaptic input from all pyramidal cells and all interneurons in the same column and in connected columns (see Figure 1). In the example to be illustrated, the average driving current to e-cells in columns A1 and B1 was 2.92 and 3.12 nA, respectively; to columns A2 and B2, 2.36 and 2.52 nA; to A3 and B3, 1.8 and 1.92 nA; to A4 and B4, 1.18 and 1.26 nA. (Thus, the current ratios for columns A1:A2:A3:A4, respectively B1:B2:B3:B4, are 1:0.8:0.6:0.4.) For the example simulation illustrated below, axon conduction delays were less than 0.5 ms within a column, 2 ms for distinct columns within a hypercolumn, and 4 ms between hypercolumns. Qualitatively similar results were obtained with different delays, including 3 ms between distinct columns in a hypercolumn and 6 ms between hypercolumns; and, respectively, 2.5 and 7.5 ms. Also, similar results were obtained with different distributions of driving currents. For example, the following current ratios were also used: 1:x:x:x (where x could be 0.2, 0.5, 0.6, or 0.7); 1:0.8:0.7:0.6; 1:0.7:0.6:0.5;
1256
Roger D. Traub, Miles A. Whittington, and John G. R. Jefferys
Figure 1: Structure of the model. The network consists of two “hypercolumns,” A and B, each consisting of four columns. A hypercolumn can be thought of as corresponding to a piece of receptive field and a column as corresponding to a piece of receptive field with a certain orientation. Structurally, a column has eight pyramidal cells and eight GABAergic interneurons, with between-column connections as indicated by the arrows. When two columns are connected by an arrow (or a column to itself), it means that each cell (pyramidal or interneuron) in one column is connected to each cell (pyramidal or interneuron) in the other. Axon conduction delays are less than 0.5 ms within a column and, in later figures, 2 ms between columns in the same hypercolumn and 4 ms between hypercolumns.
1:0.6:0.5:0.4; 1:0.6:0.4:0.2. Hence, no particular parameter tuning was necessary to produce the qualitative behavior described below. A total of 29 simulations provided background data for this study, along with over 200 simulations of models with different distributed network topology. Programs were written in FORTRAN augmented with special instructions for IBM SP1, and SP2 parallel computers. Simulation of 1500 ms of neural activity took 3.6 hours on the SP2 and 5.2 hours on the SP1, both using eight nodes. This study thus used 835 SP2 node-hours. (For further details on the simulations, send e-mail to [email protected].) 3 Results We shall illustrate in detail the results of one particular simulation. In this simulation, the connected columns A1 and B1, while in distinct hypercolumns and separated by 4 ms axon-conduction delays, are both driven maximally (and nearly equally); all other columns are driven less (but not uniformly less). This is the situation that might be expected to occur in visual cortex following stimulation with a long bar.
Gamma Oscillation Model
1257
Figure 2: Behavior of two maximally stimulated columns in different hypercolumns, separated by 4 ms axon conduction delay. The average of four pyramidal cell voltages (respectively, interneuron voltages) is plotted for cells from hypercolumn A, column 1 (A1) and hypercolumn B, column 1 (B1). The population oscillation (35 Hz) is apparent. Pyramidal cell firing in A1 may lag firing in B1 (−) or lead it (+). Interneurons fire synchronized doublets and triplets. (The peaks in these signals that are less than full action potential height result from averaging a small number of voltage signals, in which action potentials are not perfectly synchronized.)
The behavior of the system can be summarized as follows: 1. All columns oscillate at the same frequency (35 Hz). 2. Pyramidal cell firing in the maximally stimulated columns, A1 and B1, occurs with some relative jitter (see Figure 2), but the mean phase lag in the cross-correlation is less than 1 ms (Figure 4). This occurs despite the axon conduction delay of 4 ms (or up to 7.5 ms in other simulations, not shown) between hypercolumns A and B. 3. In contrast, pyramidal cells in nearby (but differently excited) columns A1 and A3 fire consistently out of phase, with maximally driven A1 always leading A3 (see Figure 3), and a mean phase lag of 2.8 ms (see Figure 4). This takes place even though the conduction delay is only 2 ms between A1 and A3. 4. Interneurons fire in synchronized doublets and triplets (see Figures 2 and 3), in a pattern similar to that described previously (Traub, Whittington, Stanford, & Jefferys, 1996).
1258
Roger D. Traub, Miles A. Whittington, and John G. R. Jefferys
Figure 3: Behavior of two nearby columns, separated by 2 ms axon conduction delay, one column stimulated maximally and the other not. The averages of four pyramidal cell voltages (respectively, interneuron voltages) is plotted for cells from hypercolumn A, column 1 (A1) and hypercolumn A, column 3 (A3). The pyramidal cells in A3 receive 0.6 times the mean driving current of pyramidal cells in A1. Pyramidal cell firing in A1 consistently leads pyramidal cell firing in A3 (+).
5. The interneuron cross-correlations are, at least approximately, distributed symmetrically about zero (see Figure 5). 4 Discussion Our model replicates an interesting experimental observation on stimulusevoked gamma-frequency oscillations in the visual cortex: distant (at least 7 mm experimentally), but strongly stimulated, columns can oscillate with near-zero phase lag; while nearby, but unequally stimulated, columns can oscillate at the same frequency, but with the weakly stimulated column lagging the strongly stimulated one. Thus, oscillation phase would encode one aspect of how strongly driven neurons are by a nonlocal stimulus (Hopfield, 1995). The essential idea is this: in local networks, current drive to pyramidal cells alters phase relative to the mean oscillation; in distributed networks with uniform stimulation, phase delays can be small despite long axon conduction delays; and, at least with a simple network topology (and without recurrent excitation), these two principles can work together. In order for the replication of data to be useful, the model must meet two re-
Gamma Oscillation Model
1259
Figure 4: Auto- and cross-correlations of mean pyramidal cell voltages. Autoand cross-correlations were performed on 750 ms of data, the average voltage of four pyramidal cells from each of three columns (A1, A3, and B1). Each column oscillates at 35 Hz. Columns A1 and B1, both maximally driven but also with maximum respective axon conduction delay, oscillate with 0.9 ms mean phase difference. Column A1 leads column A3 by an average of 2.8 ms. Columns A1 and A3 have only 2 ms mutual axon conduction delay, but A3 receives 0.6 times the drive of A1. (The auto- and cross-correlations appear so smooth because it is intracellular voltage signals, including subthreshold voltages, that are being correlated, rather than extracellular unit firing signals, as might be used with in vivo data.)
quirements: it must rest on a reasonable experimental basis, and it must make experimental predictions. Selected, but critical, elements of the network model are supported by the following data: 1. The subnetwork consisting of a local group of interneurons, synaptically interconnected, correctly predicts the dependence of oscillation frequency on GABAA conductance and IPSC time constant; it also correctly predicts the breakup of the oscillation as a sufficiently low frequency is reached (Whittington, Traub, & Jefferys, 1995; Traub, Whittington, Colling, Buzs´aki, & Jefferys, 1996; Wang & Buzs´aki, 1996). 2. The model correctly predicted the existence of a “tail” of gammafrequency oscillations following the synchronous firing of pyramidal cells (Traub, Whittington, Colling, Buzs´aki, & Jefferys, 1996).
1260
Roger D. Traub, Miles A. Whittington, and John G. R. Jefferys
Figure 5: Auto- and cross-correlations of mean interneuron voltages. Auto- and cross-correlations were performed on 750 ms of data, the average voltage of four interneurons from each of three columns (A1, A3, and B1). Synchronized multiplet firing shows up as the secondary peaks in the autocorrelations at ±4 to 5 ms. The A1/A3 cross-correlation has a trough near 0 ms, with peaks at +1.7 and −2.4 ms.
3. The model correctly predicted several nonintuitive features of collective gamma oscillations in the hippocampal slice: oscillations coherent over long distances (up to 4 mm), but not oscillations occurring only locally, are associated with firing of interneuron doublets (intradoublet interval usually about 4–5 ms) at gamma frequency; the first interneuron spike of the doublet is synchronized with the pyramidal cell action potential; population oscillations slow in frequency as larger neuronal ensembles are entrained (Traub, Whittington, Stanford, & Jefferys, 1996; Whittington, Stanford, Colling, Jefferys, & Traub, in press). 4. Experiments indicate that tetanic stimuli, when of different amplitudes, applied to different subregions of CA1 (in the hippocampal slice) lead to simultaneous oscillations in the respective subregions, of common frequency, but different phase (Whittington, Stanford, Colling, Jefferys, & Traub, in press). These data provide confidence that certain structural underpinnings of the gamma oscillation model are accurate. Of course, it is true that most
Gamma Oscillation Model
1261
of the experimental data supporting this model derive from the hippocampal CA1 region, wherein recurrent excitation is sparse, while we would like to draw conclusions about the neocortex, wherein recurrent excitation is dense. Further experiments and simulations are clearly necessary. The experimental analysis of recurrent excitation may prove difficult. For example, nonspecific blockade of AMPA receptors will block AMPA receptors on interneurons, which is predicted to disrupt long-range synchronization (Traub, Whittington, Stanford, & Jefferys, 1996). While a selective blocker for interneuron AMPA receptors exists (Iino, Koike, Isa, & Ozawa, 1996), what is required instead is a blocker specific for pyramidal cell AMPA receptors. An alternative approach, used here, is to generate testable predictions in a model that simply omits (for now) e → e connections. What is different about the model explained here, as compared with the model in previous publications, is the more complex topological structure, which attempts to capture two connected “hypercolumns” rather than simply a one-dimensional chain of cell groups, and the use of different axon conduction delays for nearby columns, as opposed to more distant ones. One result of this additional complexity is the more complicated firing patterns of the interneurons in this, as compared with earlier, network models (Traub et al., in press; Traub, Whittington, Stanford, & Jefferys, 1996): triplets in the present case versus predominantly doublets before. This feature may provide one critical experimental test of this model; if interneurons exist that connect between cortical columns and hypercolumns, possibly layer III large basket cells (Kisv´arday, 1992; Kisv´arday, Beaulieu, & Eysel, 1993; Kisv´arday & Eysel, 1993), then these interneurons should fire in short bursts during visual stimulation with large objects, objects that evoke gamma oscillations synchronized over many millimeters. (See also Bush & Sejnowski, 1996.) Furthermore, on average, interneuron bursts should be synchronized with one another.
Acknowledgments This work was supported by IBM and the Wellcome Trust. We are grateful for helpful discussions to Hannah Monyer, Rodolfo Llin´as, Charles Gray, Wolf Singer, Gyorgy ¨ Buzs`aki, and Nancy Kopell.
References Adrian, E. D. (1950). The electrical activity of the mammalian olfactory bulb. Electroenceph. Clin. Neurophysiol., 2, 377–388. Baskys, A., & Malenka, R. C. (1991). Agonists at metabotropic glutamate receptors presynaptically inhibit EPSCs in neonatal rat hippocampus. J. Physiol., 444, 687–701.
1262
Roger D. Traub, Miles A. Whittington, and John G. R. Jefferys
Bragin, A., Jando, ´ G., Nadasdy, Z., Hetky, J., Wise, K., & Buzs´aki, G. (1995). Gamma (40–100 Hz) oscillation in the hippocampus of the behaving rat. J. Neurosci., 15, 47–60. Bush, P. C., & Sejnowski, T. J. (1996). Inhibition synchronizes sparsely connected cortical neurons within and between columns in realistic network models. J. Comput. Neurosci., 3, 91–110. Eckhorn, R. (1994). Oscillatory and non-oscillatory synchronizations in the visual cortex and their possible roles in associations of visual features. Prog. Brain Res., 102, 405–426. Freeman, W. J. (1959). Distiribution in space and time of prepyriform electrical activity. J. Neurophysiol., 22, 644–666. Frien, A., Eckhorn, R., Bauer, R., Woelbern, T., & Kehr, H. (1994). Stimulusspecific fast oscillations at zero phase between visual areas V1 and V2 of awake monkey. NeuroReport, 5, 2273–2277. Gereau, R. W., IV, & Conn, P. J. (1995). Multiple presynaptic metabotropic glutamate receptors modulate excitatory and inhibitory synaptic transmission in hippocampal area CA1. J. Neurosci., 15, 6879–6889. Gray, C. M. (1994). Synchronous oscillations in neuronal systems: Mechanisms and functions. J. Comput. Neurosci., 1, 11–38. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Gray, C. M., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. USA, 86, 1698–1702. Guly´as, A. I., Miles, R., Sik, A., Toth, ´ K., Tamamaki, N., & Freund, T. F. (1993). Hippocampal pyramidal cells excite inhibitory neurons through a single release site. Nature, 366, 683–687. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Iino, M., Koike, M., Isa, T., & Ozawa, S. (1996). Voltage-dependent blockage of Ca2+ -permeable AMPA receptors by joro spider toxin in cultured rat hippocampal neurones. J. Physiol., 496, 431–437. Joliot, M., Ribary, U., & Llin´as, R. (1994). Human oscillatory brain activity near 40 Hz coexists with cognitive temporal binding. Proc. Natl. Acad. Sci. USA, 91, 11748–11751. Kisv´arday, Z. F. (1992). GABAergic networks of basket cells in the visual cortex. Prog. Brain Res., 90, 385–405. Kisv´arday, Z. F., Beaulieu, C., & Eysel, U. T. (1993). Network of GABAergic large basket cells in cat visual cortex (area 18): Implication for lateral disinhibition. J. Compar. Neurol., 327, 398–415. Kisv´arday, Z. F., & Eysel, U. T. (1993). Functional and structural topography of horizontal inhibitory connections in cat visual cortex. Eur. J. Neurosci., 5, 1558–1572. Konig, ¨ P., Engel, A. K., Roelfsema, P. R., & Singer, W. (1995). How precise is neuronal synchronization? Neural Comp., 7, 469–485. Leung, L. S. (1987). Hippocampal electrical activity following local tetanization. I. Afterdischarges. Brain Res., 419, 173–187.
Gamma Oscillation Model
1263
Llin´as, R., & Ribary, U. (1993). Coherent 40-Hz oscillation characterizes dream state in humans. Proc. Natl. Acad. Sci. USA, 90, 2078–2081. Murthy, V. N., & Fetz, E. E. (1992). Coherent 25- to 35-Hz oscillations in the sensorimotor cortex of awake behaving monkeys. Proc. Natl. Acad. Sci. USA, 89, 5670–5674. Sik, A., Penttonen, M., Ylinen, A., & Buzs´aki, G. (1995). Hippocampal CA1 interneurons: An in vivo intracellular labeling study. J. Neurosci., 15, 6651–6665. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annual Rev. Neurosci., 18, 555–586. Soltesz, I., & Deschˆenes, M. (1993). Low- and high-frequency membrane potential oscillations during theta activity in CA1 and CA3 pyramidal neurons of the rat hippocampus under ketamine-xylazine anesthesia. J. Neurophysiol., 70, 97–116. Steriade, M., Amzica, F., & Contreras, D. (1995). Synchronization of fast (30– 40 Hz) spontaneous cortical rhythms during brain activation. J. Neurosci., 16, 392–417. Steriade, M., Contreras, D., Amzica, F., & Timofeev, I. (1996). Synchronization of fast (30–40 Hz) spontaneous oscillations in intrathalamic and thalamocortical networks. J. Neurosci., 16, 2788–2808. Traub, R. D., Jefferys, J. G. R., & Whittington, M. A. (in press). Simulation of gamma rhythms in networks of interneurons and pyramidal cells. J. Comput. Neurosci. Traub, R. D., & Miles, R. (1995). Pyramidal cell-to-inhibitory cell spike transduction explicable by active dendritic conductances in inhibitory cell. J. Comput. Neurosci., 2, 291–298. Traub, R. D., Whittington, M. A., Colling, S. B., Buzs´aki, G., & Jefferys, J. G. R. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol., 493, 471–484. Traub, R. D., Whittington, M. A., Stanford, I. M., & Jefferys, J. G. R. (1996). A mechanism for generation of long-range synchronous fast oscillations in the cortex. Nature, 383, 621–624. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural firing. J. Comput. Neurosci. 1, 313–321. Wang, X.-J., & Buzs´aki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. Wang, X.-J., & Rinzel, J. (1993). Spindle rhythmicity in the reticularis thalami nucleus: Synchronization among mutually inhibitory neurons. Neuroscience, 53, 899–904. Whittington, M. A., Stanford, I. M., Colling, S. B., Jefferys, J. G. R., & Traub, R. D. (in press). Spatiotemporal patterns of gamma-frequency oscillations tetanically induced in the rat hippocampal slice. J. Physiol. Whittington, M. A., Jefferys, J. G. R., & Traub, R. D. (1996). Effects of intravenous anaesthetic agents on fast inhibitory oscillations in the rat hippocampus in vitro. Br. J. Pharmacol., 118, 1977–1986. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615.
1264
Roger D. Traub, Miles A. Whittington, and John G. R. Jefferys
Wong, R. K. S., & Prince, D. A. (1981). Afterpotential generation in hippocampal pyramidal cells, J. Neurophysiol., 45, 86–97.
Received September 27, 1996; accepted January 22, 1997.
Communicated by William Skaggs
Multiunit Normalized Cross Correlation Differs from the Average Single-Unit Normalized Correlation Purvis Bedenbaugh Department of Otolaryngology and Keck Center for Integrative Neuroscience, University of California at San Francisco, San Francisco, CA 94143, U.S.A.
George L. Gerstein Department of Neuroscience, University of Pennsylvania, Philadelphia, PA 19104, U.S.A.
As the technology for simultaneously recording from many brain locations becomes more available, more and more laboratories are measuring the cross-correlation between single-neuron spike trains, and between composite spike trains derived from several undiscriminated cells recorded on a single electrode (multiunit clusters). The relationship between single-unit correlations and multiunit cluster correlations has not yet been fully explored. We calculated the normalized cross-correlation (NCC) between singleunit spike trains and between small clusters of units recorded in the rat somatosensory cortex. The NCC between small clusters of units was larger than the NCC between single units. To understand this result, we investigated the scaling of the NCC with the number of units in a cluster. Multiunit cross-correlation can be a more sensitive detector of neuronal relationship than single-unit cross-correlation. However, changes in multiunit cross-correlation are difficult to interpret uniquely because they depend on the number of cells recorded on each electrode and because they can arise from changes in the correlation between cells recorded on a single electrode or from changes in the correlation between cells recorded on two electrodes. 1 Introduction One of the great challenges of contemporary neuroscience is to understand the distributed patterns of brain activity associated with learned behaviors and remembered stimuli. Increasing evidence supports the view that assemblies of neurons (Hebb, 1949; Gerstein, Bedenbaugh, & Aertsen, 1989) subserving particular percepts and behaviors are embedded in a nonlinear, cooperative network. The collective activity of the cells within a neuronal assembly provides a context for interpreting the activity of each of its component cells. The experimental detection of such neuronal assemblies Neural Computation 9, 1265–1275 (1997)
c 1997 Massachusetts Institute of Technology °
1266
Purvis Bedenbaugh and George L. Gerstein
and study of their properties depends on simultaneously recording from many neurons and on measurement of timing interrelationships among the several neuronal activities. Analysis of the interactions between neuronal assemblies is even more demanding of recording technology and intensive calculations. For over thirty years, the cross-correlation between two neuronal spike trains (Perkel, Gerstein, & Moore, 1967b), its generalizations (Gerstein & Perkel, 1972; Aertsen, Gerstein, Habib, & Palm, 1989), and its homolog in the frequency domain (Perkel, 1970; Rosenberg, Amjad, Breeze, Brillinger, & Halliday, 1989) have been the principal tools for assessing neuronal relationships. Methods have been developed to normalize the cross-correlation so that the resulting correlation coefficient may be made independent of both firing rate and of modulations in firing rate (Aertsen et al., 1989; Palm, Aertsen, & Gerstein, 1988) and may be fairly compared across different observations. When these methods are applied to single-unit recordings, correlating the spikes of two simultaneously recorded single neurons, the results are unambiguous. It is clear which particular neurons are correlated; changes in correlation are due to a change in the coordination of particular cells, not to changes in recruitment or to poor recording stability. There are, however, difficulties in using spike train pair correlation to study neural assembly properties. Single-unit cross-correlation is a relatively insensitive measure of neuronal relationship (Bedenbaugh, 1993). Given the low firing rates of most cortical neurons, this makes it difficult to detect a relationship between simultaneously recorded spike trains within a reasonably short recording time. In turn, this translates into fewer data points per experiment, a major limitation for investigations that seek to explore the distributed pattern of neuronal responses and their coordination. The development of spike-shape sorting technology has partially improved the situation with respect to sampling and efficiency problems, since it allows separation of multiunit observations from a single electrode into spike trains that originate from different neurons. However, the price is a complex technology, and also a combinatorial explosion of pairs as the number of separable simultaneously recorded neurons increases. The latter problem has been partially mitigated, although at the cost of intensive computation (Gerstein, Perkel, & Dayhoff, 1985; Gerstein & Aertsen, 1985). In reaction to these difficulties, a number of investigators have begun to use multiunit recordings without separation of waveforms; that is, they use a composite data stream consisting of several superimposed spike trains from each electrode. The rationale is that cells in localized cortical columns have homogeneous response properties. Changes in the firing rate of a multiunit recording often reflect changes in the activity of the individual neurons in the population. If we assume that possible confounding factors, such as recruitment of additional neurons or recording instability, can be minimized, it is reasonable and increasingly common to study neuronal assembly properties by cross-correlation of multiunit recordings.
Multiunit Normalized Cross-Correlation
1267
We might expect that the normalized cross-correlation (NCC) between two multiunit spike trains would correspond to the average cross-correlation between the single units recorded on different electrodes. However, for our recordings in rat somatosensory cortex, we found that the multiunit normalized cross-correlation was larger than the single-unit normalized crosscorrelation. The analysis in this article suggests that this may be because the correlation between units recorded on a single electrode influences the multiunit cross-correlation. This view is also consistent with Kreiter and Singer’s (1995) recent report of dependence of multiunit cross-correlation on the local correlation between the single-unit spike trains that comprise the multiunit spike trains. 2 Methods 2.1 Experimental. Extracellular neuronal potentials were recorded from the forepaw region of the somatosensory cortex of seven rats using either three or five tungsten microelectrodes and conventional methods, described elsewhere (Bedenbaugh, 1993; Bedenbaugh, Gerstein, & Bolton, 1997). Recordings from each of the several electrodes used in each experiment were separated into a total of eighty-nine single-unit spike trains with 0.5 millisecond time resolution using hardware principal component spike sorters (Gerstein, Bloom, Espinosa, Evanczuk, & Turner, 1983). Of these, 638 pairs of single-unit trains had been simultaneously recorded and were available for cross-correlation analysis. Finally, a total of twenty-one multiunit spike trains were created by merging of the sorted single-unit spike trains. Each cell pair was recorded under four different somatic stimulation conditions: no stimulus (spontaneous activity), touch to the best receptive field of the cells recorded by one electrode, touch to the best receptive field of the cells recorded by another electrode, and a simultaneous touch to both skin sites (Bedenbaugh, 1993; Bedenbaugh et al., 1997). The data were blocked into a sequence of stimulus-dependent trials and peri-stimulus-time (PST) histograms and raw cross-correlograms were computed for each stimulus condition. The normalized cross-correlation (correlation coefficient) histogram for a pair of spike trains was calculated by subtracting the cross-correlation of the two PST histograms divided by the number of stimuli (PST predictor) (Perkel et al., 1967a) from the raw crosscorrelogram, then normalizing the result by the geometric mean variance of the two point processes (Bedenbaugh, 1993; Bedenbaugh et al., 1997). 2.2 Analytical. Consider the situation illustrated schematically in Figure 1. There are two electrodes, A and B, each recording from m cells so that the spike train recorded on electrode A is A(t) = a1 (t) + a2 (t) + · · · + am (t), and the spike train recorded on electrode B is B(t) = b1 (t)+b2 (t)+· · ·+bm (t), where t is time, and m = 2 in the figure. Discretize time into NT total bins of equal duration, short enough that any one cell may fire no more than one
1268
Purvis Bedenbaugh and George L. Gerstein
spike in one bin-width of time. If we assign the value 1 to the occurrence of a spike, and 0 to the absence of a spike in a bin, then the expected number of spikes per unit time of any individual cell, say ai , is Ni /NT , where P T Ni = N k=1 aik where k denotes a particular time bin. Since the point process derived from the spike train can only take the values 0 or 1, the expectation (mean value) E[a2ik ] is also Ni /NT . Further, assuming that all of the individual spike trains contain the same number of spikes and that the number of times any two cells recorded on a single electrode both fire in one time bin is Nclose , we see then that the mean and variance of a multiunit process, say A, are µA = E[Ak ] =
m X
E[aik ] = m
i=1
Ni NT
σA2 = E[A2k ] − E2 [Ak ] Ã ! m m X X = E aik ajk − µ2A i=1
=
m m X X
j=1
E[aik ajk ] − µ2A
i=1 j=1
= m(m − 1)
Nclose Ni N2 +m − m2 2i , NT NT NT
where expectation is taken across the time index k. The assumptions leading to these equations are equivalent to saying that all of the cells recorded by a single electrode have the same firing rate, and the normalized crosscorrelation between any two of the single-unit spike trains recorded by a single electrode has the same value, ρclose . If we further assume that the correlation coefficient between the spike trains of any two cells recorded on different electrodes has the same value, ρdistant , and that the average firing rate of all of the cells recorded on both electrodes is the same, we can find that the covariance of the two multiunit spike trains is CAB = E[Ak Bk ] − E[Ak ]E[Bk ] Ã ! m m X X aik bjk − µA µB = E i=1
=
m m X X
j=1
E[aik bjk ] − µA µB
i=1 j=1
= m2
Ndistant N2 − m2 2i . NT NT
Multiunit Normalized Cross-Correlation
1269
A
B
2
1
3
4
ρdistant ρclose
ρclose
Figure 1: Analytical paradigm. The analysis considers the cross-correlation between two multiunit spike trains recorded by electrode A (recording cells 1 and 2) and electrode B (recording cells 3 and 4). The analysis considers the situation in which all spike trains recorded on a single electrode have one value of normalized cross-correlation, (ρclose , solid lines) and spike trains recorded on different electrodes have another value of NCC (ρdistant , dashed lines).
Normalizing by the geometric mean of the variance of the two multiunit spike trains and canceling terms, we find that the normalized crosscorrelation (correlation coefficient) function for the multiunit spike trains is
CAB ρAB = q σA2 σB2 =
mρdistant . (m − 1)ρclose + 1
Although it might appear that this expression for ρAB might range beyond the interval [−1, 1], this possibility is excluded by assumptions that the correlation between any two cells recorded on one electrode is the same, and that the correlation between any two cells recorded on different electrodes is the same. Moreover, in the usual situation for neurophysiological recordings, with the number of spikes small compared to the total number of time bins, only positive and very small negative values for cross-correlation are possible.
Purvis Bedenbaugh and George L. Gerstein
4 3 0
0
1
2
multiunit pairs
150 100 50
single unit pairs
200
5
250
1270
0.0
0.2
0.4
0.6
single unit NCC peak area
0.8
0.0
0.2
0.4
0.6
0.8
multiunit NCC peak area
Figure 2: Single- and multiunit peak NCC (normalized cross correlation). The peak NCC for our population of pairs of single-unit spike trains is shown at the left. At the right, the peak NCC for the corresponding population of multiunit spike trains is shown. The multiunit correlation peaks are much larger.
3 Results 3.1 Experimental. Figure 2 shows the area under the central peak (τ = ±10 msec, τ the time shift variable) of the normalized cross-correlation for our rat somatosensory system data set (Bedenbaugh, 1993; Bedenbaugh et al., 1997). The normalized cross-correlation for the multiunit spike trains is clearly larger than the normalized cross-correlation for the single-unit spike trains. Note that the scale on the ordinate is different for the two graphs; since only one multiunit spike train was obtained from each electrode, the data set included many more single-unit cross-correlograms. 3.2 Analytical. Figure 3 shows that for our model system, the multiunit normalized cross-correlation (ρmultiunit ) is a linear function of the NCC between the single-unit spike trains recorded on different electrodes, ρdistant . The slope of this relationship is determined by the NCC between spike trains recorded on the same electrode. Counting the correlation between each of the cells recorded on one electrode and each cell recorded on the
Multiunit Normalized Cross-Correlation
1271
ρmultiunit 1.0
0.5
-0.5
0.5
1.0
ρdistant
close =-0.
.0
ρ
=0 se clo
ρ
clo
ρ
ρ
cl
os
e
se
=1
=0
.0
.5
5
-1.0
-1.0
Figure 3: Multiunit NCC as a function of distant NCC (m = 2). The multiunit NCC, ρmultiunit , is proportional to the distant NCC, ρdistant . The slope is linearly proportional to m, the number of cells recorded on a single electrode, and is inversely related to the close NCC, ρclose . Note that the minimum slope is 1, when m = 1 and ρclose = 1.
other electrode effectively amplifies the multiunit NCC. Each spike train has multiple opportunities to contribute to the multiunit NCC. Note that the linear dependence of ρmultiunit on ρdistant with zero intercept implies that a difference in the sign of two ρmultiunit measurements may be unambiguously interpreted as arising from a difference in the sign of ρdistant . This is also evident in the symmetry about the abscissa of ρmultiunit as a function of ρclose , as seen in Figure 4. The special case of ρclose = ρdistant = ρ is illustrated in Figure 5. Note that the slope for negative correlation is steeper than for positive correlation, implying that in this circumstance multiunit NCC is a more sensitive detector of negative cross-correlation than of positive cross-correlation. The NCC is plotted as an inset in Figure 5 for the case of two cells recorded on each electrode. The special case of ρclose = ρdistant courses a trajectory diagonally across this surface.
1272
Purvis Bedenbaugh and George L. Gerstein
ρmultiunit 1.0
ρdistant= 0.5
ρdistant= 0.9
ρdistant= 0.1 0.5
-1.0
-0.5
0.5
1.0
ρclose
-0.5 ρdistant= -0.1 -1.0
ρdistant= -0.5
ρdistant= -0.9
Figure 4: Multiunit NCC as a function of close NCC (m = 2). The multiunit NCC, ρmultiunit , is a nonlinear function of close NCC, ρclose . Note that in the situation we considered, positive values for ρdistant lead to positive values for ρmultiunit , and negative values for ρdistant lead to negative values for ρmultiunit .
3.3 Discussion. This study was inspired by the observation that the normalized cross-correlation (correlation coefficient) between multiunit spike trains recorded from the rat somatosensory cortex was significantly larger than the cross-correlation between the single-unit spike trains composing the multiunit recordings. Our analysis suggests why this should be so and suggests that multiple-unit cross-correlation is an intrinsically more sensitive detector of neuronal relationships than is the single-unit cross correlation. Two technical points should be noted. First, the analysis here is valid for any bin in a normalized cross-correlation histogram, for both exact coincidences and delayed coincidences. Second, the analysis presented here does not take into account the refractory period of the recording system measuring the multiunit spike trains, making it incapable of resolving simultaneous events on a single electrode. This dead time deletes some nearly simultaneous spikes from different cells from the multiunit spike train, and
Multiunit Normalized Cross-Correlation
1273
ρmultiunit 1.0
m=4 m=2
0.5
-1.0
-0.5
0.5
1.0
ρ
inset for m=2 1
ρmultiunit -1.0
−1 0.8 0.0
ρ
−.8 −.8 distant
0.0
0.8
ρclose
Figure 5: Multiunit NCC when all single-unit NCCs are equal. In the special case where the NCC between any two cells is the same, regardless of location (ρclose = ρdistant = ρ), ρmultiunit is a nonlinear function of ρ. Given our assumptions, it is an even more sensitive indicator of negative cross-correlation than of positive cross-correlation. Note, however, that the range of negative correlations that can satisfy the restriction of all recorded cells equally correlated is small. The inset plots the multiunit correlation as a surface parameterized by ρclose and ρdistant for the case of two neurons recorded on each electrode. The large curve for m = 2 courses a trajectory diagonally across this surface.
so should reduce the sensitivity of the multiunit cross-correlation to correlations between spike trains recorded on a single electrode. This analysis considered the case with all observed neurons having the same, constant firing rate. If firing rate changes slowly, these results should still apply. If some of the neurons within a multiunit cluster fire at different rates, we expect less dependence of cross-correlation on the number of neurons in a cluster. For this model system, the multiunit NCC is larger than the single-unit cross-correlation between cells recorded on different electrodes. This means
1274
Purvis Bedenbaugh and George L. Gerstein
that although the NCC between two multiunit spike trains is related to the NCC between the individual spike trains recorded by the two electrodes, it is not an unambiguous quantitative measure of that relationship. For example, if we take ρclose = 0, the multiunit NCC increases linearly with the number of cells comprising the multiunit spike trains, so that a change in recruitment may masquerade as a change in multiunit cross-correlation. More generally, a change in multiunit NCC can also result from either a change in the local NCC between cells recorded on a single electrode or the distant NCC between cells recorded on different electrodes, unless the change results in a sign change. These ambiguities limit the utility of comparisons among individual multiunit cross-correlograms. There is a place for comparing populations of multiunit cross-correlograms across treatment groups, though possible changes in the typical number of cells recorded due to the treatment would remain a concern. On the other hand, multiunit NCC is a more sensitive indicator of a relationship between spike trains than is single-unit NCC, and if the recordings are stable and stationary, a change in multiunit cross-correlation implies some underlying change in the cross-correlation between the single-unit spike trains. Multiunit NCC may be especially useful for detecting negative cross-correlations, which have been observed more rarely than have positive cross-correlations. The ambiguity with respect to the underlying correlation relationship and the difficulty of assessing the stationarity (uniform recruitment) and stability of multiunit recordings limit the conclusions that can be drawn by comparing multiunit cross-correlation measurements. With appropriate caution, the technical ease with which such measurements can be obtained, and their special sensitivity for detecting cross-correlation, make them useful tools for neuroscientists nonetheless. Acknowledgments A preliminary version of this work was presented as a poster at the 1993 annual meeting of the Society for Neuroscience (Bedenbaugh & Gerstein, 1993). It was supported by NIDCD F32-DC00144 and RO1-DC01249 and by NIMH MH R37-467428. Thanks to Michael Brecht and Heather Read for reading the manuscript. Thanks to Hagai Attias for helpful discussions. Portions of this work were performed in the laboratory of Michael M. Merzenich. References Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal firing correlation: Modulation of “effective connectivity.” Journal of Neurophysiology, 61 (5), 900–917. Bedenbaugh, P. H. (1993). Plasticity in the rat somatosensory cortex induced
Multiunit Normalized Cross-Correlation
1275
by local microstimulation and theoretical investigations of information flow through neurons. Unpublished doctoral dissertation, University of Pennsylvania, Philadelphia. Bedenbaugh, P. H., & Gerstein, G. L. (1993). Multiunit normalized cross correlation differs from the average single unit normalized correlation in rat somatosensory cortex. Society for Neuroscience Abstracts, 19 (2), 1566. Bedenbaugh, P. H., Gerstin, G. L., & Bolton, M. (1997). Intracortical microstimulation in rat somatosensory cortex. Receptive field plasticity and somatic stimulus dependent spike train correlations. Unpublished manuscript. Gerstein, G. L., & Aertsen, A. M. H. J. (1985). Representation of cooperative firing activity among simultaneously recorded neurons. Journal of Neurophysiology, 54 (6), 1513–1528. Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. M. H. J. (1989). Neuronal assemblies. IEEE Transactions on Biomedical Engineering, 36 (1), 4–14. Gerstein, G. L., Bloom, M. J., Espinosa, I. E. Evanczuk, S., & Turner, M. R. (1983). Design of a laboratory for multineuron studies. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13 (5), 668–676. Gerstein, G. L., & Perkel, D. H. (1972). Mutual temporal relationships among neuronal spike trains. Biophysical Journal, 12, 453–473. Gerstein, G. L., Perkel, D. H., & Dayhoff, J. E. (1985). Cooperative firing activity in simultaneously recorded populations of neurons: Detection and measurement. Journal of Neuroscience, 5 (4), 881–889. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Kreiter, A. K., & Singer, W. (1995). Dynamic patterns of synchronization between neurons in area mt of awake macaque monkeys. Society for Neuroscience Abstracts, 21 (2), 905. Palm, G., Aertsen, A. M. H. J., & Gerstein, G. L. (1988). On the significance of correlations among neuronal spike trains. Biological Cybernetics, 59, 1–11. Perkel, D. H. (1970). Spike trains as carriers of information. In F. O. Schmitt (Ed.), The Neurosciences: A Second Study Program (pp. 587–596). New York: Rockefeller University Press. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967a). Neuronal spike trains and stochastic point processes ii. simultaneous spike trains. Biophysical Journal, 7 (4), 419–440. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967b). Neuronal spike trains and stochastic point processes i. The single spike train. Biophysical Journal, 7 (4), 391–418. Rosenberg, J. R., Amjad, A. M., Breeze, P., Brillinger, D. R., & Halliday, D. M. (1989). The Fourier approach to the identification of functional coupling between neuronal spike trains. Progress in Biophysics and Molecular Biology, 53, 1–31.
Received May 14, 1996; accepted November 12, 1996.
Communicated by Garrison Cottrell
A Simple Common Contexts Explanation for the Development of Abstract Letter Identities Thad A. Polk Department of Psychology, University of Michigan, Ann Arbor, MI 48109, U.S.A.
Martha J. Farah Department of Psychology, University of Pennsylvania, Philadelphia, PA 19104, U.S.A.
Abstract letter identities (ALIs) are an early representation in visual word recognition that are specific to written language. They do not reflect visual or phonological features, but rather encode the identities of letters independent of case, font, sound, and so forth. How could the visual system come to develop such a representation? We propose that because many letters look similar regardless of case, font, and other characteristics, these provide common contexts for visually dissimilar uppercase and lowercase forms of other letters (e.g., e between k and y in key and E in the visually similar context K-Y). Assuming that the distribution of words’ relative frequencies is comparable in upper- and lowercase (that just as key is more frequent than pew, KEY is more frequent than PEW), these common contexts will also be similarly distributed in the two cases. We show how this statistical regularity could lead Hebbian learning to produce ALIs in a competitive architecture. We present a self-organizing artificial neural network that illustrates this idea and produces ALIs when presented with the most frequent words from a beginning reading corpus, as well as with artificial input. 1 Introduction People can read text written in many different ways: in UPPERCASE, lowercase, or even AlTeRnAtInG cAsE; in different sizes and in different fonts. In many cases, these different forms look very different (e.g., compare are and ARE) and yet we effortlessly identify them as the same word. How do we do it? One popular hypothesis is that an early stage in reading involves the computation of abstract letter identities (ALIs), that is, a representation of letters that denotes their identity but abstracts away from their visual appearance (uppercase versus lowercase, font, size, etc.) (Coltheart, 1981; Besner, Coltheart, & Davelaar, 1984; Bigsby, 1988; Mozer, 1989; Prinzmetal, Hoffman, & Vest, 1991), and a number of empirical studies have provided support for this hypothesis. For example, even when subjects are asked to classify pairs Neural Computation 9, 1277–1289 (1997)
c 1997 Massachusetts Institute of Technology °
1278
Thad A. Polk and Martha J. Farah
of letter strings as same or different based purely on physical criteria, letter strings that differ in case but share the same letter identities (e.g., HILE/hile) are distinguished less efficiently than are strings with different spellings but the same phonological code (e.g., HILE/hyle) (Besner et al., 1984). This result suggests that subjects do compute ALIs (identical ALIs [HILE/hile] impair performance), and this computation cannot be turned off and is thus architectural (the task is based on physical equivalence, not ALI equivalence, and yet ALIs affected performance). Also, just as subjects tend to underestimate the number of letters in a string of identical letters (e.g., DDDD) compared with a heterogeneous string (e.g., GBUF), subjects also tend to underestimate the total number of Aa’s and Ee’s in a display more often when uppercase and lowercase instances of a single target appear (e.g., Aa) than when one of each target appears (e.g., Ae; Mozer, 1989). The repetition of ALIs must be responsible for this effect because visual forms were not repeated in the mixed-case displays (e.g., uppercase A and lowercase a look different). Notice that these effects cannot be due simply to the visual similarity of different visual forms for a given letter because they also show up for letters whose visual forms are not visually similar (e.g., A/a, D/d, E/e). It appears that this representation is not a phonological code but is computed earlier by the visual system itself. For example, Coltheart (1981) reported a brain-damaged patient who had lost the ability to produce or distinguish the sounds of words and letters, but was able to retrieve the meanings of written words and to determine whether nonwords that differed in case were composed of the same letters (ANER/aner versus ANER/aneg). This result suggests that the representation of ALIs does not depend on letter phonology because access to ALIs was preserved even when letter phonology was severely impaired. Similarly, Reuter-Lorenz and Brunn (1990) described a patient with damage to the left extrastriate visual cortex who was impaired at extracting letter identities and was thus unable to read. In a functional magnetic resonance imaging study, Polk et al. (1996) found an area in left inferior extrastriate visual cortex that responded significantly more to alternating case words and pseudowords than to consonant strings. A natural interpretation of this result is that the brain area is sensitive to abstract, rather than visual, letter and word forms, because the stimuli were visually unfamiliar. Finally, Bigsby (1988) presented empirical evidence in normal subjects suggesting that ALIs are computed prior to phonological or lexical access. The evidence, then, is that a relatively early representation in the visual processing of orthography (before lexical access or the representation of phonology) is specific to our writing system and does not reflect fundamental visual properties such as shape. Different stimuli that have little or no visual similarity (e.g., A and a) are represented with similar codes. What would lead visual areas to develop such a representation?
A Common Contexts Explanation for ALIs
1279
2 The Common Contexts Hypothesis We propose that the visual contexts in which different forms of the same letter appear interact with correlation-based Hebbian learning in the visual system of the brain to produce ALIs. Specifically, because many letters look similar regardless of case, font, and so forth, we assume that different visual forms of the same letter share similar distributions of visual contexts and that this correlation leads the visual system to produce representations corresponding to ALIs. People are exposed to any given word written in a variety of different ways: all capitals, lowercase, in different fonts, in different sizes, and so on. Because the shapes of some letters are relatively invariant over these transformations, they tend to make the contexts in which one visual form of a letter appear similar to the contexts in which other visual forms of that same letter appear. For example, if the visual form a occurs in certain contexts (e.g., between c and p in cap, before s in as), then the visual form A will also occur in visually similar contexts (between C and P in CAP, before S in AS). Of course, the different visual forms of some letters are fairly different (e.g., D versus d), and so there will be some contexts that are unique to one visual form of a letter (e.g., a but not A occurs in the context d-d). But given that a number of letters are similar in their uppercase and lowercase forms (Cc, Kk, Oo, Pp, Ss, Uu, Vv, Ww, Xx, Zz, and to some degree Bb, Ff, Hh, Ii, Jj, Ll, Mm, Nn, Tt, and Yy; the obvious exceptions are Aa, Dd, Ee, Gg, Qq, and Rr) and that letters in different fonts and sizes tend to look similar, there will be a significant proportion of common contexts for dissimilar-looking uppercase and lowercase letter forms.
3 A Neural Network Model Why would common contexts lead the visual system to produce representations corresponding to ALIs? Figure 1 presents a simple and natural mechanistic model that demonstrates one possibility. The model is a two-layer neural network that uses a Hebbian learning rule to modify the weights of the connections between the input and output layers. Hebbian learning is a neurophysiologically plausible mechanism that generally corresponds to the following rule: if two units are both firing, then their connection is strengthened; if only one unit of a pair is firing, then their connection is weakened (Hebb, 1949). The input layer represents the visual forms of input letters using a localist representation (each unit represents a visual form, and similar visual forms, such as C and c, are represented by the same unit). This representation does not code letter position because it does not play a role in our explanation. Initially, the output layer does not represent anything (since the connections from the input layer are initially random), but with training it should self-organize to represent ALIs. Each unit in the
1280
Thad A. Polk and Martha J. Farah
Figure 1: A neural network model of the development of abstract letter identities. The input layer (top) represents the visual forms of input letters but does not code their position. Letters whose uppercase and lowercase forms are visually similar are represented by a single unit (e.g., C, P, S). In this example, the word cap is presented. Initially, the output representation is random (left), but eventually a cluster of activity develops (right), and Hebbian learning strengthens the connections to it from the input letters.
output layer is connected to its eight surrounding neighbors by excitatory connections (except for units on the edges of the output layer; there is no wraparound) and units farther away are connected by inhibitory connections, in keeping with previous models of cortical self-organization (von der Malsburg, 1973). As the last simulation will demonstrate, the critical assumption is that there is competition in the output layer (implemented here by inhibitory connections); the excitatory connections are not necessary. Figure 1 illustrates the model’s behavior when the first word (say, cap) is initially presented. The pattern of output activity is initially random (see Figure 1, left), reflecting the random initial connection strengths. The shortrange excitatory connections lead to clusters around the most active units (see Figure 1, middle), and these in turn drive down activity elsewhere by the longer-range inhibitory connections leading to a single cluster (or a small number) (see Figure 1, right). The Hebbian rule then strengthens the connections to this active cluster from the active input units (the letters c, a, and p), but weakens the connections from other (inactive) inputs as well as the connections from the active input to the inactive output units. Because
A Common Contexts Explanation for ALIs
1281
of these weight changes, c, a, and p will subsequently be biased toward activating units in that cluster, while other inputs (e.g., d) will be biased away from that cluster. So, for example, if the word dog was presented next, it would tend to excite units outside the cluster excited by cap. Now suppose we present the same word, but in uppercase (CAP; see Figure 2). Because C and P are visually similar to c and p, their input representations will also be similar (in this simple localist model, that means they excite the same units; in a more realistic distributed model, the representations would share many units rather than being identical). As a result, C and P will be biased toward exciting some of the same output units that cap excited. The input A, however, has no such bias. Indeed, its connections to the cap cluster would have been weakened (because it was inactive when the cluster was previously active), and it excites units outside this cluster (top of the output layer at the left of Figure 2). The cluster inhibits these units via the long-range inhibitory connections, and it eventually wins out (right of Figure 2). Hebbian learning again strengthens the connections from the active inputs including A) to the cluster. The result is that a and A are biased toward exciting nearby units despite the fact that they are visually dissimilar and initially excited quite different units. An ALI has emerged. If this were the whole story, then one might expect all letters to converge on the same output representation because many different letters occasionally occur in common contexts (e.g., a and o in c-t). The reason this does not happen is that the distributions of contexts in which different letters appear are very different. For most contexts in which a occurs, there is a visually similar context in which A occurs (in the same word written in uppercase). Furthermore, these contexts will occur with comparable relative frequencies because words that are frequent in lowercase will also tend to be frequent in uppercase.1 So if a frequently occurs in a given context (e.g., says), then the corresponding visually similar context will also be frequent for A (e.g., SAYS). And infrequent contexts for a will be infrequent for A (zap and ZAP). As a result, the same forces that shape the representation of a (that it looks more like s than like z) will shape the representation of A, and the two inputs will end up having similar output representations. The same is not true for different letters. Although a and o occur in some of the same contexts (cat and cot), there are many contexts in which only one of the two letters can appear. Furthermore, even contexts that admit either letter will not occur with comparable relative frequencies (cat is roughly fifty times more frequent than cot in beginning reading books). Thus, very 1 The claim is that these contexts appear with comparable relative, not absolute, frequencies. Lowercase contexts are more frequent than uppercase contexts in most corpora, but the relative frequencies are similar. So just as says is more frequent than zap, SAYS is more frequent than ZAP, and it does not matter much to the model whether SAYS is more or less frequent than zap.
1282
Thad A. Polk and Martha J. Farah
Figure 2: The behavior of the network from Figure 1 when presented with the word CAP. C and P excite the previous cluster because of their visual similarity to the c and p in cap, but at first A excites a distinct set of units (top of the output layer). These two regions of activity compete until the previous cluster wins out. Hebbian learning then strengthens the connections from all the active inputs (including A) to the cluster, biasing a and A toward exciting nearby units.
different forces will shape the representation of these letters, and they will end up with dissimilar output representations. 4 Results Figure 3 shows the results of presenting a network like this with simple stimuli that satisfy the constraints outlined above. The stimuli consisted of a random sequence of thirty-six three-letter words: twelve in uppercase and twenty-four in lowercase (the same twelve words that appeared in uppercase were presented in lowercase twice in the random sequence). Each word contained one letter from a set of three ALI candidate letters—each of these letters had two possible visual forms (e.g., uppercase and lowercase)— and each such letter appeared in four of the twelve words. The other two letters were randomly chosen from a set of twenty whose visual forms were similar in uppercase and lowercase. Figure 3 shows the output activity when presenting each of the three candidate ALI letters in each of their visual forms after training. In all three cases, the output representations for the two different visual forms are virtually identical; the output representation
A Common Contexts Explanation for ALIs
1283
Figure 3: The patterns of output activity when presenting two different visual forms for each of three letters to the initial network model after training. In all three cases, the two different visual forms activate similar output representations, that is, ALIs. Note that each of the three letters has a distinct representation.
corresponds to an ALI for the letter. Also note that the output representations of the three different letters were distinct. In fact, these output patterns were unique; no other inputs produced these patterns. 5 Other Simulations This simple simulation demonstrates the feasibility of the common contexts hypothesis, but it does not demonstrate that the idea would work for more realistic input sets. Accordingly, we developed three other simulations using such sets. The third is the most realistic model of the three, but because the first two led to some important insights incorporated in the third, it is worth describing them briefly. In order to allow for all twenty-six letters, we increased the number of input units from twenty-six to thirty-two (twenty-six lowercase letters and the uppercase forms for the six letters that are most dissimilar in the two forms: A, D, E, G, Q, and R). Once again, all input units were connected to all output units and neighboring output units were excitatory while nonneighboring units were inhibitory. Our first attempt at a more realistic input set was Dr. Seuss’s Hop on Pop. After “reading” this book ten times, the sim-
1284
Thad A. Polk and Martha J. Farah
ulation developed similar representations for A and a but not for the other five letters with dissimilar forms. This corpus was too small for upper- and lowercase contexts to be comparably distributed (e.g., many words that occurred frequently in lowercase never appeared in uppercase, and vice versa), however, so it is not surprising that the other letters failed. Indeed, it was the failure of these other letters that suggested the importance of the overall distribution of contexts. Accordingly, we used a more realistic corpus from beginning reading books for our next simulation (Baker & Freebody, 1989). This corpus is composed of the words from a large number of school books that are used in teaching children to read in Australia. It was far too large to train our simulation on the entire corpus, so we constructed a corpus of 3750 words from the 50 most frequent words with appropriate relative frequencies in random order (50 percent of all the words in the corpus came from this set of 50 words). Also, the corpus did not distinguish different visual forms of the same word, so we presented the same number of UPPERCASE, lowercase, and Initial Capital forms for each word.2 This simulation consistently developed an ALI for E/e, but not for A/a or D/d, and the other three letters with dissimilar forms (G, Q, R) developed almost no representations at all. Both models demonstrated that the hypothesized interaction between common contexts and correlation-based learning can lead to some ALIs. The reason ALIs did not develop for all the letters is that the simple Hebbian learning rule we were using puts more and more emphasis on frequent letters and less and less on infrequent letters. Whenever an input is inactive, its connections to any active outputs are weakened. As a result, the connections from infrequent letters to output units that are activated with any regularity are constantly reduced in strength. Conversely, the connections from frequent inputs to the outputs they activate are regularly being strengthened. The result is that frequent letters come to dominate the representation of words. This undermines the development of ALIs in two ways. First, frequent letters such as A dominate the output representations of words in which they occur, making the surrounding context letters irrelevant. Thus, if A and a start out with different output representations, as they usually do, the common contexts in which they occur become irrelevant because A and a dominate the representations rather than the context letters. Second, whatever ALIs may develop for infrequent letters quickly get squashed by the weakening of their connection strengths when those letters are not present (and they normally are not present). This problem with the simple Hebbian learning rule we used has led previous researchers to reject it in favor of more sophisticated alternatives in
2 There were two exceptions. The lowercase forms of proper names still had the first letter capitalized and the Initial Capital forms of verbs that are not imperatives (which would rarely occur as the first word of a sentence) were presented in lowercase.
A Common Contexts Explanation for ALIs
1285
A
D
E
G
Q
R
a
d
e
g
q
r
Figure 4: The patterns of output activity when presenting two different visual forms for each of six letters to the final network model after training. In most cases, the two different visual forms again activate similar output representations, that is, ALIs.
which weights are normalized in some way so that frequent inputs will not completely dominate infrequent ones. After all, real neural learning mechanisms are quite adept at exploiting input features that are predictive, even if they are very infrequent. In our final simulation, we adopted a zero-sum Hebbian learning rule based on a number of neurophysiological constraints that has been shown to have nice convergence properties in both activation space and weight space (O’Reilly & McClelland, 1992; see Miller, 1990, and Rolls, 1989, for similar rules). The appendix gives the details of the model. Figure 4 presents the results of training the revised architecture on the beginning reading corpus using the zero-sum Hebbian learning rule. O’Reilly and McClelland (1992) used this rule in a winner-take-all architecture with inhibitory connections but no excitatory connections within a layer, so we tried it both with and without excitatory connections. The results were similar, so we present only the results without excitatory connections. As Figure 4 shows, the upper- and lowercase forms of five of the six letters of interest converged on similar representations. The representations of R and r were different, but only because the weights from r were smaller than those from R, so only the most active output passed the threshold of the allor-none style sigmoid activation function that was used. The underlying patterns of weights were similarly distributed; the weights from R were simply stronger. Also, notice that the representations of the infrequent letters (G, Q, and R) are similar to each other. The reason is that the zero-sum Hebbian rule is weight conserving: the positive and negative weight changes sum to zero. As a result, when the weights from infrequent inputs to frequently activated outputs are weakened, the connections from those infrequent inputs to the infrequent outputs are strengthened. Thus, the infrequent outputs are
1286
Thad A. Polk and Martha J. Farah
shared by all the infrequent inputs, and their representations look similar. There are certainly differences among the representations of the infrequent letters, but these results indicate that a substantial part of the representations for these letters is indicating that the letter is infrequent, not that it is a G, Q, or R. In fact, Q and q developed similar representations despite never appearing in the corpus at all. Finally, it is worth pointing out that although the different visual forms for each letter converged on similar representations, the similarity is not as strong as it was for the first simulation, which used artificial input. The reason is that the realistic input set used in this simulation often violated the assumptions of the common contexts hypothesis. Many of the words in the beginning reading corpus are composed of multiple letters that look different in upper- and lowercase (e.g., all three letters in are/ARE look different in upper- and lowercase). These words, some of which are among the most frequent in the corpus, tend to undermine the development of ALIs because they do not provide common visual contexts in their upper- and lowercase forms. Do humans represent A and a with similar codes, or do they compute an identical representation at some point during processing? The empirical evidence reviewed above does not distinguish these alternatives. Like identical codes, similar codes would make it harder to distinguish strings that differed only in case and could lead subjects to underestimate the number of target letters in a display in which upper- and lowercase instances of a single target appear. In any case, the common contexts hypothesis does not take a stand on this issue. The reason is that unsupervised neural networks are exceptionally good at learning to map similar inputs to identical outputs, so it would be straightforward to compute identical representations for different visual forms of individual letters from the outputs of the current model. 6 Discussion It is becoming clear that modality-specific visual processing can become specialized for dealing with written language. ALIs are not the only such representation. For example, in a positron emission tomography study, Petersen, Fox, Snyder, and Raichle (1990) found that certain areas in the left medial extrastriate visual cortex were activated by visually presented words and pseudowords that obey English spelling rules, but not by nonsense strings of letters or letter-like forms. The visual characteristics of these different stimulus types were carefully matched, suggesting that these visual areas had developed a representation that is sensitive to the statistical regularities of real text, not just to low-level visual features that appear in a wide range of stimuli. How could visual representations come to encode abstract, orthographic regularities? In the case of ALIs, we proposed that the statistics of the visual environment (the common contexts surrounding letters in words) interact
A Common Contexts Explanation for ALIs
1287
with correlation-based learning in the brain to form representations of letter identities that abstract away from differences in physical form (e.g., a, A). A variant of this hypothesis has previously been used to explain the observed neural localization of arbitrary and noninnate categories (e.g., letters and digits) (Polk & Farah, 1995a, 1995b; Farah, Polk, et al., 1996). To demonstrate the feasibility of the common contexts hypothesis as an explanation for the development of ALIs, we showed that simple competitive Hebbian networks spontaneously self-organize to produce ALIs when exposed to orthographic input that preserves a basic statistical feature of real text, namely, that different visual forms of the same letter appear in similarly distributed common contexts. An obvious alternative to the proposed explanation for the development of ALIs is that ALIs emerge from hearing the same sound associated with the different visual forms of a given letter, for example, when children are learning to read. Any supervised learning algorithm, such as backpropagation, could obviously learn to group the different visual forms of a single letter if it was provided with phonological feedback. This explanation is certainly viable, but the common contexts hypothesis does have certain advantages. First, supervised learning does not have the independent physiological plausibility that Hebbian learning has. Second, if phonology plays such a crucial role, one might expect an impairment in letter phonology to undermine any attempts to read. But phonological impairments need not destroy reading (Coltheart, 1981). Thus, one would need to assume that phonology plays a critical role in learning ALIs, but that once they are acquired, they no longer depend on phonology. Third, this hypothesis must assume that phonology, a representation that is relatively far downstream, influences the development of a representation used in a simple visual matching task, which does not even require recognizing that letters are present, much less identifying them and retrieving their names (Besner et al., 1984). The common contexts hypothesis, on the other hand, proposes that ALIs arise bottom up, from an interaction between the statistics of shapes present in the visual input and the nature of cortical learning mechanisms. It therefore accommodates in a natural way the findings that ALIs are independent of phonology and that ALI effects show up in simple visual matching tasks. Furthermore, the hypothesis does not depend on supervised learning, but rather is based on a simple, unsupervised Hebbian mechanism that has independent physiological plausibility. Empirical work is needed to distinguish these hypotheses conclusively, but the common contexts hypothesis does offer a viable and attractive alternative to the phonological account. Appendix: Details of the Final Neural Network Model The network has seventy-four total units. Thirty-two are inputs (6 ALI candidate letters × 2 visual forms each + the 20 other letters) and the other
1288
Thad A. Polk and Martha J. Farah
forty-two are outputs in a 7 × 6 2D arrangement. All input units are connected to all output units with plastic connections. Output units are connected to each other by fixed inhibitory connections (weight = −0.1). The minimum and maximum unit firing rates are fixed at 0.0 and 100.0, and the minimum and maximum connection weights are fixed at 0.0 and 1.0. Initially, the activity of output units is uniform random between 0.0 and 10.0, and the connection weights from inputs to outputs are uniform random between 0.0 and 0.6. The following zero-sum Hebbian learning rule based on firing rate is used after every cycle to update connection strengths between input and output units (for details and motivation, see O’Reilly & McClelland, 1992): changeij = 0.01(oj − hoiout )(oi − hoiin ) if changeij > 0 then wij = wij + changeij (1.0 − wij ) else wij = wij + changeij (wij − 0.0) where oi = activation of input unit i oj = activation of output unit j hoiin = average activation in input layer hoiout = average activation in output layer wij = weight of connection between unit i and unit j. The output units use a sigmoid transfer function: output = 100.0/(1 + e−(input−40.0) ). The total input to each output unit (the weighted sum of the activations of the units to which it is connected) is multiplied by a 0.9 gain factor before passing through the transfer function. The input units are clamped to their values (0.0 when not firing, 100.0 when firing) and do not decay. Acknowledgments This research was supported by a grant-in-aid for training from the McDonnell-Pew program in cognitive neuroscience to T.A.P. and by grants from the NIH, ONR, Alzheimer’s Disease Association, and the Research Association of the University of Pennsylvania. We gratefully acknowledge Max Coltheart for helpful comments on this research and article. Requests for reprints should be sent to Thad Polk, Department of Psychology, University of Michigan, 525 E. University, Ann Arbor, Michigan 48109-1109, U.S.A.
A Common Contexts Explanation for ALIs
1289
References Baker, C. D., & Freebody, P. (1989). Children’s first school books: Introductions to the culture of literacy. New York: Blackwell. Besner, D., Coltheart, M., & Davelaar, E. (1984). Basic processes in reading: Computation of abstract letter identities. Canadian Journal of Psychology, 38, 126– 134. Bigsby, P. (1988). The visual processor module and normal adult readers. British Journal of Psychology, 79, 455–469. Coltheart, M. (1981). Disorders of reading and their implications for models of normal reading. Visible Language, 15, 245–286. Farah, M., Polk, T. A., Stallcup, M., Aguirre, G., Alsop, D., D’Esposito, M., Detre, J., & Zarahn, E. (1996). Localization of a fine-grained category of shape: An extrasite letter area revealed by fMRI. Society for Neuroscience Abstracts, 291(1). Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley. Miller, K. D. (1990). Correlation-based models of neural development. In M. Gluck & D. Rumelhart (Eds.), Neuroscience and connectionist theory. Hillsdale, NJ: Erlbaum. Mozer, M. (1989). Types and tokens in visual letter perception. Journal of Experimental Psychology: Human Perception and Performance, 15, 287–303. O’Reilly, R. C., & McClelland, J. L. (1992). The self-organization of spatially invariant representations (Tech. Rep. No. PDP.CNS.92.5). Pittsburgh, PA: Carnegie Mellon University. Petersen, S. E., Fox, P. T., Snyder, A. Z., & Raichle, M. E. (1990). Activation of extrastriate and frontal cortical areas by visual words and word-like stimuli. Science, 249, 1041–1044. Polk, T. A., & Farah, M. (1995a). Lute experience alters vision. Nature, 376, 648– 649. Polk, T. A., & Farah, M. (1995b). Brain localization for arbitrary stimulus categories: A simple account based on Hebbian learning. Proc. of the Nat’l Academy of Sciences USA, 92, 12370–12373. Polk, T. A., Stallcup, M., Aguirre, G., Alsop, D., D’Esposito, M., Detre, J., Zarahan, E., & Farah, M. (1996). Abstract, not just visual, orthographic knowledge encoded in extrastriate cortex: An fMRI study. Society for Neuroscience Abstracts, 291(2). Prinzmetal, W., Hoffman, H., & Vest, K. (1991). Automatic processes in word perception: An analysis from illusory conjunctions. Journal of Experimental Psychology: Human Perception and Performance, 17, 902–923. Reuter-Lorenz, P. A., & Brunn, J. L. (1990). A prelixical basis for letter-by-letter reading: A case study. Cognitive Neuropsychology, 7 (1), 1–20. Rolls, E. T. (1989). Functions of neuronal networks in the hippocampus and neocortex in memory. In J. H. Byrne & W. O. Berry, Neural Models of Plasticity: Experimental and Theoretical Approaches. San Diego: Academic Press. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14, 85–100. Received May 10, 1995; accepted November 11, 1996.
Communicated by Graeme Mitchison
A Unifying Objective Function for Topographic Mappings Geoffrey J. Goodhill Sloan Center for Theoretical Neurobiology, Salk Institute for Biological Studies, La Jolla, CA 92037, U.S.A. and Georgetown Institute for Cognitive and Computational Sciences, Georgetown University Medical Center, Washington, DC 20007, U.S.A.
Terrence J. Sejnowski The Howard Hughes Medical Institute, Salk Institute for Biological Studies, La Jolla, CA 92037, U.S.A. and Department of Biology, University of California San Diego, La Jolla, CA 92037, U.S.A.
Many different algorithms and objective functions for topographic mappings have been proposed. We show that several of these approaches can be seen as particular cases of a more general objective function. Consideration of a very simple mapping problem reveals large differences in the form of the map that each particular case favors. These differences have important consequences for the practical application of topographic mapping methods. 1 Introduction The notion of a topographic mapping that takes nearby points in one space to nearby points in another appears in many domains, both biological and practical. A number of algorithms have been proposed that claim to construct such mappings (reviewed in Goodhill & Sejnowski, 1996). However, a fundamental problem with these claims, or the claim that a particular given mapping is topographic, is that the computational-level meaning of topography has not been formally defined. Given the wide variety of contexts in which topography or degrees of topography are discussed, it is unlikely that an application-independent definition would be generally useful. A more productive goal is to attempt to lay bare the assumptions behind different approaches to topographic mappings, and thus clarify the relationships among them. In this way, the hope is to make it easier to choose the appropriate approach for a particular problem. In this article, we introduce an objective function we call the C measure and show that it has the capacity to unify several different approaches to topographic mapping. These approaches constitute particular instantiations of the functions from which the C measure is constructed. We then consider a simple mapping problem and explicitly optimize different versions of the C measure. It becomes apparent that the “optimally topographic” map can radically change depending on the particular measure employed. Neural Computation 9, 1291–1303 (1997)
c 1997 Massachusetts Institute of Technology °
1292
Geoffrey J. Goodhill and Terrence J. Sejnowski
2 The C measure For the purposes of this article, we consider only bijective (1-1) mappings and a finite number of points in each space. Some mapping problems intrinsically have this form, and many others can be reduced to this by separating out a clustering or vector quantization step from the mapping step. Consider an input space Vin and an output space Vout , each of which contains N points (see Figure 1). Let M be the mapping from points in Vin to points in Vout . We use the word space in a general sense: either or both of Vin and Vout may not have a geometric interpretation. Assume that for each space there is a symmetric similarity function that, for any given pair of points in the space, specifies by a nonnegative scalar value how similar (or dissimilar) they are. Call these functions F for Vin and G for Vout . Then we define a cost function C as follows, C=
N X X
F(i, j)G(M(i), M(j)),
(2.1)
i=1 j
where i and j label points in Vin , and M(i) and M(j) are their respective images in Vout . The sum is over all possible pairs of points in Vin . Since we have assumed that M is a bijection, it is therefore invertible, and C can equivalently be written as C=
N X X
F(M−1 (i), M−1 (j))G(i, j),
(2.2)
i=1 j
where i and j label points in Vout , and M−1 is the inverse map. A good mapping is one with a high value of C. However, if one of F or G is given
Vout
V in M
G(p,q)
F(i,j)
p i
j
Figure 1: The mapping framework.
q
Unifying Objective Function
1293
as a dissimilarity function (i.e., increasing with decreasing similarity), then a good mapping has a low value of C. How F and G are defined is problem specific. They could be Euclidean distances in a geometric space, some (possibly nonmonotonic) function of those distances, or they could just be given, in which case it may not be possible to interpret the points as lying in some geometric space. C measures the correlation between the F’s and the G’s. It is straightforward to show that if a mapping that preserves orderings exists, then maximizing C will find it. This is equivalent to saying that for two vectors of real numbers, their inner product is maximized over all permutations within the two vectors if the elements of the vectors are identically ordered, a proof of which can be found in Hardy, Littlewood, and Polya ´ (1934, p. 261). 3 Measures Related to C Several objective functions for topographic mapping can be seen as versions of the C measure, given particular instantiations of the F and G functions. 3.1 Metric Multidimensional Scaling. Metric multidimensional scaling (metric MDS) is a technique originally developed in the context of psychology for representing a set of N entities (e.g., subjects in an experiment) by N points in a low- (usually two-) dimensional space. For these entities one has a matrix that gives the numerical dissimilarity between each pair of entities. The aim of metric MDS is to position points representing entities in the low-dimensional space so that the set of distances between each pair of points matches as closely as possible the given set of dissimilarities. The particular objective function optimized is the summed squared deviations of distances from dissimilarities. The original method was presented in Torgerson (1952); for reviews, see Shepard (1980) and Young (1987). In terms of the framework presented earlier, the MDS dissimilarity matrix is F. Note that there may not be a geometric space of any dimensionality for which these dissimilarities can be represented by distances (for instance, if the dissimilarities do not satisfy the triangle inequality), in which case Vin does not have a geometric interpretation. Vout is the low-dimensional, continuous space in which the points representing entities are positioned, and G is Euclidean distance in Vout . Metric MDS selects the mapping M by adjusting the positions of points in Vout , which minimizes N X X (F(i, j) − G(M(i), M(j)))2 .
(3.1)
i=1 j
Under our assumptions, this objective function is identical to the C mea-
1294
Geoffrey J. Goodhill and Terrence J. Sejnowski
sure. Expanding out the square in equation 3.1 gives N X X (F(i, j)2 + G(M(i), M(j))2 − 2F(i, j)G(M(i), M(j))).
(3.2)
i=1 j
The last term is twice the C measure. The entries in F are fixed, so the first term is independent of the mapping. In metric MDS, the sum over the G’s varies as the entities are moved in the output space. If one instead considers the case where the positions of the points in Vout are fixed, and the problem is to find the assignment of entities to positions that minimize equation 3.1, then the sum over the G’s is also independent of the map. In this case the metric MDS measure becomes exactly equivalent to C. 3.2 Minimal Wiring. In minimal wiring (Mitchison & Durbin, 1986; Durbin & Mitchison, 1990), a good mapping is defined as one that maps points that are nearest neighbors in Vin as close as possible in Vout , where closeness in Vout is measured by, for instance, Euclidean distance raised to some power. The motivation here is the idea that it is often useful in sensory processing to perform computations that are local in some space of input features Vin . To do this in Vout (e.g., the cortex) the images of neighboring points in Vin need to be connected; the dissimilarity function in Vout is intended to capture the cost of the wire (e.g., axons) required to do this. Minimal wiring is equivalent to the C measure for ½ 1 : i, j neighboring F(i, j) = 0 : otherwise G(M(i), M(j)) = ||M(i) − M(j)||p . For the cases of one- or two-dimensional square arrays investigated in Mitchison and Durbin (1986) and Durbin and Mitchison (1990), neighbors are taken to be just the two or four adjacent points in the array, respectively. 3.3 Minimal Path Length. In this scheme, a good map is one such that, in moving between nearest neighbors in Vout , one moves the least possible distance in Vin . This is, for instance, the mapping required to solve the traveling salesman problem (TSP), where Vin is the distribution of cities and Vout is the one-dimensional tour. This goal is implemented by the elastic net algorithm (Durbin & Willshaw, 1987; Durbin & Mitchison, 1990; Goodhill & Willshaw, 1990), which measures dissimilarity in Vin by squared distances: F(i, j) = ||vi − vj ||2 ½ G(p, q) =
1 : 0 :
p, q neighboring otherwise
Unifying Objective Function
1295
where vk is the position of point k in Vin (we have only considered here the regularization term in the elastic net energy function, which also includes a term matching input points to output points). Thus, minimal wiring and minimal path length are symmetrical cases under equation 2.1, with respect to exchange of Vin and Vout . Their relationship is discussed further in Durbin & Mitchison (1990), where the abilities of minimal wiring and minimal path length are compared with regard to reproducing the structure of the map of orientation selectivity in the primary visual cortex (see also Mitchison, 1995). 3.4 The Approach of Jones, Van Sluyters, and Murphy (1991). Jones, Van Sluyters, and Murphy (1991) investigated the effect of the shape of the cortex (Vout ) relative to the lateral geniculate nuclei (Vin ) on the overall pattern of ocular dominance columns in the cat and monkey, using an optimization approach. They desired to keep both neighboring cells in each LGN (as defined by a hexagonal array), and anatomically corresponding cells between the two LGNs, nearby in the cortex (also a hexagonal array). Their formulation of this problem can be expressed as a maximization of C when ½ 1 : i, j neighboring, corresponding F(i, j) = 0 : otherwise and
½ G(p, q) =
1 : 0 :
p, q first or second nearest neighbors otherwise.
For two-dimensional Vin and Vout they found a solution such that if F(i, j) = 1, then G(M(i), M(j)) = 1, ∀i, j. Alternatively this problem could be expressed as a minimization of C when G(p, q) is the stepping distance (see below) between positions in the Vout array. They found this gave appropriate behavior for the problem addressed. 3.5 Minimal Distortion. Luttrell and Mitchison have introduced mapping functionals for the continuous case. Under appropriate assumptions to reduce them to the discrete case, these are equivalent to C. Luttrell (1990, 1994) defined a minimal distortion principle that can be interpreted as a measure of mapping quality. He defined “distortion” D as Z D = dx P(x) d{x, x0 [y(x)]}. x and y are vectors in the input and output spaces, respectively. y is the map in the input to output direction, x0 is the (in general different) map back again, and P(x) is the probability of occurrence of x. x0 and y are suitably adjusted to minimize D. An augmented version of D includes additive noise
1296
Geoffrey J. Goodhill and Terrence J. Sejnowski
in the output space, Z Z D = dx P(x) dn π(n) d{x, x0 [y(x) + n]}, where π(n) is the probability density of the noise vector n. Intuitively, the aim is now to find the forward and backward maps so that the reconstructed value of the input vector is as close as possible to its original value after being corrupted by noise in the output space. In Luttrell (1990), the d function was taken to be {x − x0 [y(x) + n]}2 . In this case, Luttrell showed that the minimal distortion measure can be differentiated to produce a learning rule that is almost the self-organizing map rule of Kohonen (1982). For the discrete version of this, it is necessary to assume that appropriate quantization has occurred so that y(x) defines a 1-1 map, and x0 (y) defines the same map in the opposite direction. In this case the minimal distortion measure becomes equivalent to the C measure, with F(i, j) the squared Euclidean distance between the positions of vectors i and j (assuming these to lie in a geometric space) and G(p, q) the noise process in the output space (generally assumed gaussian). The minimal distortion principle was generalized in Mitchison (1995) by allowing the noise process and reconstruction error to take arbitrary forms. For instance, they can be reversed, so that F is a gaussian and G is Euclidean distance. In this case, the measure can be interpreted as a version of the minimal wiring principle, establishing a connection between minimal distortion (and hence the SOM algorithm) and minimal wiring. This identification also yields a self-organizing algorithm similar to the SOM for solving minimal wiring problems, the properties of which are briefly explored in Mitchison (1995). Luttrell (1994) generalized minimal distortion to allow a probabilistic match between points in the two spaces, which in the discrete case can be expressed as X XX F(i, j) P(k|i)P(j|k), D= i
j
k
where D is the distortion to be minimized, F is Euclidean distance in the input space, P(k|i) is the probability that output state k occurs given that input state i occurs, and P(j|k) is the corresponding Bayes inverse probability that input state j occurs given that output state k occurs. This reduces to C in the special case that P(k|i) = P(i|k), which is true for bijections. 3.6 Dynamic Link Matching. Bienenstock and von der Malsburg (1987a, 1987b) considered position-invariant pattern recognition as a problem of graph matching. Their formalism consists of two sets of nodes, one with links expressing neighborhood relationships given by F(i, j) and one with analogous links given by G(p, q). The aim is to find the bijective mapping between the two sets of nodes that optimally preserves the link structure. By
Unifying Objective Function
1297
approximating the Hamiltonian for a particular type of spin neural network, they derived an objective function H that in our notation (taking p = M(i), q = M(j)) can be written as H=
N N X X
F(i, j)G(p, q)wip wjq
i,j=1 p,q=1
2 !2 Ã N N N N X X X X +γ wip − a + wip − a . i=1
p=1
p=1
(3.3)
i=1
Here γ and a are constants, and wip is the strength of connection between node i and node p. The first term is a generalized version of the C measure, which allows a weighted match between all pairs of nodes i and p. The second term enforces the regularization constraint that the sum of the weights from or to each node is equal to a. When a = 1 and γ is large, the minima of H are clearly the same as the maxima of C. More recent work has investigated its potential as a general objective function for topographic mappings (Wiskott & Sejnowski, 1997). 4 Application to a Simple Mapping Problem To show the range of optimal maps that the C measure implies for different F and G, we consider the very simple case of mapping between a regular 10 × 10 square array of points (Vin ) and a regular 1 × 100 linear array of points (Vout ). This is a paradigmatic example of dimension reduction (also used in Durbin & Mitchison, 1990 and Mitchison, 1995). It has the virtue that solutions are easily represented (see Figures 2 and 3) and easily compared with intuition. Calculation of the global optimum for this problem by exhaustive search is clearly impractical, since there are of the order of 100! possible mappings. Instead we employ simulated annealing (Kirkpatrick, Gelatt, & Vecchi, 1983) to find a solution close to optimal. The parameters used are given in the legend to Figure 2. Figure 2 shows the best maps obtained by this method for the minimal path, minimal wiring, and metric MDS versions of the C measure. Also shown for comparison is a solution to this problem produced by the elastic net algorithm, taken from Durbin & Mitchison (1990). An optimal map for the minimal path version (A) is one with no diagonal segments. Our simulated annealing algorithm found one with three diagonal segments and a cost of 100.243, 1.3 percent larger than the optimal of 99.0. The optimal map for the minimal wiring measure has been explicitly calculated (Mitchison & Durbin, 1986), and has a cost of 914.0. Our algorithm found the map shown in (B), which is similar in shape and has a cost of 917.0 (0.3 percent larger than the optimal value). These results confirm that the simulated annealing algorithm employed can find close to optimal solutions. The cost for the
1298
Geoffrey J. Goodhill and Terrence J. Sejnowski
Figure 2: Maps found by optimizing using simulated annealing. (A) Minimal path length solution. (B) Minimal wiring solution. (C) Metric MDS solution. (D) Map found by the elastic net algorithm (see Durbin & Mitchison, 1990). Parameters used for simulated annealing were as follows (for further explanation, see van Larhoven & Aarts, 1987). Initial map: random. Move set: pairwise interchanges. Initial temperature: 3 × the mean energy difference over 10,000 moves. Cooling schedule: exponential. Cooling rate: 0.998. Acceptance criterion: 1000 moves at each temperature. Upper bound: 10,000 moves at each temperature. Stopping criterion: zero acceptances at upper bound.
metric MDS map (C) is 6,352,324. The illusion of multiple ends to the line is due to the map frequently doubling back on itself. For instance, in the fourth row up the square, the horizontal component for neighboring points in the line progresses in the order 5, 6, 4, 7, 3, 8, 2, 9, 10. Local discontinuity arises because this measure takes into account neighborhood preservation at all scales: local continuity is not privileged over global continuity. The costs of the map produced by the elastic net (D) are 102.3 for the minimal
Unifying Objective Function
1299
Figure 3: Minimal distortion solutions. (A) σ = 1.0, cost = 43.3. (B) σ = 2.0, cost = 214.7. (C) σ = 4.0, cost = 833.2. (D) σ = 20.0, cost = 18467.1. Simulated annealing parameters as in Figure 2.
path function, 1140 for the minimal wiring function, and 6,455,352 for the metric MDS function. In each case this is a higher cost than for the maps shown in (A), (B), and (C), respectively. Figure 3 shows close-to-optimal maps for the minimal distortion version of the C measure (F is squared Euclidean distance in the square, G is given 2 2 by e−d /σ where d is Euclidean distance along the line), for varying σ . For small σ , the solution resembles the minimal path optimum of Figure 2A, since the contribution from more distant neighbors on the line compared to nearest neighbors is negligible. However, as σ increases, the map changes form. Local continuity becomes less important compared to continuity at the scale of σ , the map becomes spikier, and the number of large-scale folds in the map gradually decreases until at σ = 20 there is just one. This
1300
Geoffrey J. Goodhill and Terrence J. Sejnowski
last map also shows some of the frequent doubling back behavior seen in Figure 2C.1 5 Discussion 5.1 Types of Mappings. Several qualitatively different types of map are apparent in Figures 2 and 3. For the metric MDS measure, Figure 2C, the line progresses through the square by a series of locally discontinuous jumps along one axis of the square and a comparatively smooth progression in the orthogonal direction. One consequence is that nearby points in the square never map too far apart on the line. For the minimal distortion measure with σ = 20 (see Figure 3D), the strategy is similar except that there is now one fold to the map. This decreases the size of local jumps in following the line through the square, at the cost of introducing a seam across which nearby points in the square map a very long way apart on the line. For the minimal distortion measure with small σ (see Figures 3A–C), the strategy is to introduce several folds. This gives strong local continuity, at the cost that now many points that are nearby in the square map to points that are far apart on the line. What do these maps suggest about which measures are most appropriate for different problems? If it is desired that generally nearby points should always map to generally nearby points as much as possible in both directions, and one is not concerned about very local continuity, then something like the MDS measure is useful. This may be appropriate for some data visualization applications where the overall structure of the map is more important than the fine detail. If, on the other hand, one desires a smooth progression through the output space to imply a smooth progression through the input space, one should choose something like minimal path or minimal distortion for small σ . This may be important for data visualization where it is believed the data actually lie on a lower-dimensional manifold in the high-dimensional space. However, an important weakness of this representation is that some neighborhood relationships between points in the input space may be completely lost in the resulting representation. For understanding the structure of cortical mappings, self-organizing algorithms that optimize objectives of this type have proved useful (Durbin & Mitchison, 1990; Obermayer, Blasdel, & Schulten, 1992). However, very few other objectives have been applied to this problem, so it is still an open question which are most appropriate. The type of map seen in Figure 3D has not been much investigated. There may be some applications for which such maps are worthwhile, perhaps in a neurobiological context for understand-
1 We have also explicitly optimized the objective functions proposed by Sammon (1969) and Bezdek and Pal (1995). The former produces a map similar to that of Figure 2C, while the latter resembles Figure 3D (for more details, see Goodhill & Sejnowski, 1996).
Unifying Objective Function
1301
ing why different input variables are sometimes mapped into different areas rather than interdigitated in the same area. 5.2 Relation of C to Quadratic Assignment Problems. Formulating neighborhood preservation in terms of the C measure sets it within the well-studied class of quadratic assignment problems (QAPs). These occur in many different practical contexts and take the form of finding the minimal or maximal value of an equation similar to C (see Burkard, 1984, for a general review and Lawler, 1963, and Finke, Burkard, & Rendl, 1987, for more technical discussions). An illustrative example is the optimal design of typewriter keyboards (Burkard, 1984). If F(i, j) is the average time it takes a typist to sequentially press locations i and j on the keyboard, while G(p, q) is the average frequency with which letters p and q appear sequentially in the text of a given language (note that in this example F and G are not necessarily symmetrical), then the keyboard that minimizes average typing time will be the one that minimizes the product N N X X
F(i, j)G(M(i), M(j))
i=1 j=1
(see equation 2.1), where M(i) is the letter that maps to location i. The substantial amount of theory developed for QAPs is directly applicable to the C measure. As a concrete example, QAP theory provides several different ways of calculating bounds on the minimum and maximum values of C for each problem. This could be very useful for the problem of assessing the quality of a map relative to the unknown best possible. One particular case is the eigenvalue bound (Finke et al., 1987). If the eigenvalues of symmetric matrices F and G are λi and µi , respectively, such that Pλ1 ≤ λ2 ≤ · · · ≤ λn and µ1 ≥ µ2 ≥ · · · ≥ µn , then it can be shown that i λi µi gives a lower bound on the value of C. QAPs are in general known to be of complexity NP-hard. A large number of algorithms for both exact and heuristic solution have been studied (see, e.g., Burkard, 1984; the references in Finke et al., 1987; and Simi´c, 1991). However, particular instantiations of F and G may allow efficient algorithms for finding good local optima, or alternatively may beset C with many bad local optima. Such considerations provide an additional practical constraint on what choices of F and G are most appropriate. 5.3 Many-to-One Mappings. In many practical contexts, there are many more points in Vin than Vout , and it is necessary also to specify a many-toone mapping from points in Vin to N exemplar points in Vin , where N = number of points in Vout . It may be desirable to do this adaptively while simultaneously optimizing the form of the map from Vin to Vout . For instance, shifting a point from one cluster to another may increase the clustering cost, but by moving the positions of the cluster centers decrease the sum of this
1302
Geoffrey J. Goodhill and Terrence J. Sejnowski
and the continuity cost. The elastic net, for instance, trades off these two contributions explicitly with a ratio that changes during the minimization, so that finally each cluster contains only one point and the continuity cost dominates (Durbin & Willshaw, 1987; for discussions, see Sim´ıc, 1990, and Yuille, 1990). The self-organizing map of Kohonen (1982) trades off these two contributions implicitly. Luttrell (1994) discusses allowing each point in the input space to map to many in the output space in a probabilistic manner, and vice-versa. The H function (Bienenstock & von der Malsburg, 1987a, 1987b) offers an alternative way of introducing a weighted match between input and output points. Acknowledgments We thank Steve Finch for very useful discussions at an earlier stage of this work. We are grateful to Graeme Mitchison, Steve Luttrell, Hans-Ulrich Bauer, and Laurenz Wiskott for helpful discussions and comments on the manuscript. Research was supported by the Sloan Center for Theoretical Neurobiology at the Salk Institute and the Howard Hughes Medical Institute. References Bezdek, J. C., & Pal, N. R. (1995). An index of topological preservation for feature extraction. Pattern Recognition 28:381–391. Bienenstock, E., & von der Malsburg, C. (1987a). A neural network for the retrieval of superimposed connection patterns. Europhys. Lett. 3:1243–1249. Bienenstock, E., & von der Malsburg, C. (1987b). A neural network for invariant pattern recognition. Europhys. Lett. 4:121–126. Burkard, R. E. (1984). Quadratic assignment problems. Europ. J. Oper. Res. 15:283– 289. Durbin, R., & Mitchison, G. (1990). A dimension reduction framework for understanding cortical maps. Nature 343:644–647. Durbin, R., & Willshaw, D. J. (1987). An analogue approach to the traveling salesman problem using an elastic net method. Nature 326:689–691. Finke, G., Burkard, R. E., & Rendl, F. (1987). Quadratic assignment problems. Annals of Discrete Mathematics 31:61–82. Goodhill, G. J., & Sejnowski, T. J. (1996). Quantifying neighbourhood preservation in topographic mappings. In Proceedings of the 3rd Joint Symposium on Neural Computation (pp. 61–82). Pasadena: California Institute of Technology. Available from http://www.cnl.salk.edu/˜geoff/. Goodhill, G. J., & Willshaw, D. J. (1990). Application of the elastic net algorithm to the formation of ocular dominance stripes. Network 1:41–59. Hardy, G. H., Littlewood, J. E., & Polya, ´ G. (1934). Inequalities. Cambridge: Cambridge University Press. Jones, D. G., Van Sluyters, R. C., & Murphy, K. M. (1991). A computational model for the overall pattern of ocular dominance. J. Neurosci. 11:3794–3808.
Unifying Objective Function
1303
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science 220:671-680. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biol. Cybern. 43:59–69. Lawler, E. L. (1963). The quadratic assignment problem. Management Science 9:586–599. Luttrell, S. P. (1990). Derivation of a class of training algorithms. IEEE Trans. Neural Networks 1:229–232. Luttrell, S. P. (1994). A Bayesian analysis of self-organizing maps. Neural Computation 6:767–794. Mitchison, G. (1995). A type of duality between self-organizing maps and minimal wiring. Neural Computation 7:25–35. Mitchison, G., & Durbin, R. (1986). Optimal numberings of an N×N array. SIAM J. Alg. Disc. Meth. 7:571–581. Obermayer, K., Blasdel, G. G., & Schulten, K. (1992). Statistical-mechanical analysis of self-organization and pattern formation during the development of visual maps. Phys. Rev. A 45:7568–7589. Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Trans. Comput. 18:401–409. Shepard, R. N. (1980). Multidimensional scaling, tree-fitting and clustering. Science 210:390–398. Simi´c, P. D. (1990). Statistical mechanics as the underlying theory of “elastic” and “neural” optimizations. Network 1:89–103. Simi´c, P. D. (1991). Constrained nets for graph matching and other quadratic assignment problems. Neural Computation 3:268–281. Torgerson, W. S. (1952). Multidimensional scaling, I: Theory and method. Psychometrika 17:401–419. van Laarhoven, P. J. M., & Aarts, E. H. L. (1987). Simulated annealing: Theory and applications. Dordrecht: Reidel. Wiskott, L., & Sejnowski, T. J. (1997). Objective functions for neural map formation (Tech. Rep. INC-9701). Institute for Neural Computation. Young, F. W. (1987). Multidimensional scaling: History, theory, and applications. Hillsdale, NJ: Erlbaum. Yuille, A. L. (1990). Generalized deformable models, statistical physics, and matching problems. Neural Computation 2:1–24.
Received February 1, 1996; accepted October 28, 1996.
Communicated by Klaus Obermayer
Faithful Representation of Separable Distributions Juan K. Lin David G. Grier Department of Physics, University of Chicago, Chicago, IL 60637, U.S.A.
Jack D. Cowan Department of Mathematics, University of Chicago, Chicago, IL 60637, U.S.A.
A geometric approach to data representation incorporating informationtheoretic ideas is presented. The task of finding a faithful representation, where the input distribution is evenly partitioned into regions of equal mass, is addressed. For input consisting of mixtures of statistically independent sources, we treat independent component analysis (ICA) as a computational geometry problem. First, we consider the separation of sources with sharply peaked distribution functions, where the ICA problem becomes that of finding high-density directions in the input distribution. Second, we consider the more general problem for arbitrary input distributions, where ICA is transformed into the task of finding an aligned equipartition. By modifying the Kohonen self-organized feature maps, we arrive at neural networks with local interactions that optimize coding while simultaneously performing source separation. The local nature of our approach results in networks with nonlinear ICA capabilities. 1 Introduction Barlow, Linsker, Atick, Redlich, and Field have all noted the primary importance of information theory in modeling sensory coding (Barlow, 1989; Linsker, 1989; Atick & Redlich, 1990; Field, 1994). Recently, Bell and Sejnowski (1995a, 1995b) developed an information theory-based algorithm for blind source separation. In this article, a simple synthesis of current research on coding and processing is presented by geometrically incorporating information-theoretic ideas into self-organized feature maps. Ritter and Schulten (1986) demonstrated that, contrary to intuition, the steady state of Kohonen’s self-organized feature maps (SOFM) (Kohonen, 1995) is not one in which the density of neural units is proportional to the input density. In one dimension, they showed that the Kohonen net underrepresents high-density and overrepresents low-density regions. Intuitively, an ideally formed input representation should have the density of the neural (output) units be strictly proportional to the input density. Loosely speaking, this is what is meant by a “faithful representation.” Past Neural Computation 9, 1305–1320 (1997)
c 1997 Massachusetts Institute of Technology °
1306
Juan K. Lin, David G. Grier, and Jack D. Cowan
faithful representation SOFMs of Linsker (1989) and DeSieno (1988) have relied on a form of memory to track the cumulative activity of individual units, which the authors termed “historical information” and “conscience,” respectively. Recently, Bauer, Der, and Herrmann (1996) used an adaptive step size to accomplish the task. This article continues along the same path and presents modifications to both the update rule and the partition to allow the SOFM to perform blind signal processing. Section 2 provides preliminary definitions. Two modifications of the onedimensional Kohonen SOFM are presented in section 3, which allow for an arbitrary power law relationship between input (data) and output (representation) densities. Section 4 generalizes the faithful representation SOFMs to higher dimensions for independent component analysis. The implications for source separation and optimal coding via the representations are made clear with examples. 2 Preliminaries We begin with some preliminary definitions and background results. Given an input set {ξE } with ξEj in the input vector space V, we say that {ξE } is a separable (input) distribution if it is constructed from nonsingular mixtures of statistically independent scalar sources. So {ξE } = {AEs}, where A is a nonsingular matrix, and the components of Es are Q all statistically independent. Thus the joint probability factorizes: P(Es) = Pk (sk ). By a repreE of regions in V, Ä(wEi ) ⊂ V. sentation of {ξE }, we mean a collection {Ä(w)} The unit wEi is said to represent ξj if ξj ∈ Ä(wEi ). A nonoverlapping representation is defined to be one in which the regions are pairwise disjoint: Ä(wEi ) ∩ Ä(wEi0 ) = ∅, for all pairs i 6= i0 of regions in the representation. A covering is a representation where V = ∪i Ä(wEi ). So the entire input vector space V is represented in a covering, and any element in V is represented by exactly one unit in a nonoverlapping covering (partition). Focusing on nonoverlapping P coverings simplifies things because probability is already normalized, i P(wEi ) = 1. The assignment of probability in overlapping regions does not have to be considered. For a nonoverlapping covering {Ä(w)} of {ξ }, P(w|ξ ) = P(w, ξ ) = 1 if w represents ξ , and 0 otherwise. The mutual P information H(ξ ; W) = H(W) + P(ξ, w) log P(w|ξ ) = H(W). This article deals exclusively with nonoverlapping coverings, Although equipartition is a better mathematical term, to incorporate some intuition into our terminology, we use the term faithful representation for a nonoverlapping covering in which the probability measures of all the regions are equal. Thus, faithful representations evenly partition the input distribution and maximize the mutual information between the input data and the representation. With M units in the faithful representation, the mutual information of the system is H(ξ ; W) = H(W) = log M, the capacity of a noiseless channel (Shannon & Weaver, 1949). In the continuum limit, a faithful input representation’s local density of units is proportional to the
Separable Distributions
1307
local input density. We now describe two local algorithms for obtaining faithful representations. 3 Modifications of the One-Dimensional SOFM For simplicity, we start with the one-dimensional case, V = <. In a locally translationally invariant network with unit positions {w} receiving scalar input ξ , the Kohonen update rule is given by 1wi = κ3(i∗ − i)(ξ − wi ),
(3.1)
where the “winning” unit i∗ associated with input ξ is i∗ = {i : |wi − ξ | ≤ |wj − ξ | ∀j},
(3.2)
κ is the learning rate, and the local neighborhood function 3(i∗ − i) incorporates the network topology. With care taken with respect to the midpoints between adjacent units, the Kohonen net is a dynamical nonoverlapping covering in our terminology. Here wi gives the position of the “neural” unit and Ä(wi ) its “receptive field”—defined to be the region in V that it represents. From a set of points {w}, equation 3.2 partitions V according to the nearest-neighbor Voronoi tessellation (see, e.g., Preparata & Shamos, 1988). In this learning rule, the change in position of a unit is proportional to its distance to the data point. Units close to the winning unit are dragged along according to the predefined neighborhood function 3(i∗ − i), which is taken to be a gaussian with width σ . For a more general update rule, with ² = i∗ − i, we scale the learning rule as follows: 1wi = κ3(²)(ξ − wi )|ξ − wi |n−1 .
(3.3)
The extra weighting term allows for a variable relation between the input data and output representation densities. Notice the n = 0 form discards the distance to the input data and relies only on a sign term. Consequently it is a purely digital update rule, where wi can only change by discrete values. Pursuing this idea further, we propose a second modification as follows:1 1wi = κ3(²) · sgn(²) · |wi∗ − wi |n .
(3.4)
In order to extend our results to the boundaries, we also complement equation 3.4 with a movement, X 1wi , (3.5) 1wi∗ = i 1
The sign function sgn(i) takes on a value of 1 for i > 0, 0 for i = 0 and −1 for i < 0.
1308
Juan K. Lin, David G. Grier, and Jack D. Cowan
of the winning unit. The position of the input ξ does not directly enter in the n = 0 form of this variant. The network responds only according to its partition of the input space. Following the continuum limit analysis of Ritter and Schulten (1986), to order O(² 2 ), the steady-state configuration of both of the above modified SOFMs is (see appendix A)
(w(x)0 )−1 ∝ [P(w)]( n+2 ) , 2
(3.6)
where (w(x)0 )−1 ≡ D(w) is the density of the output units. The Kohonen algorithm is comparable to the n = 1 case, with the well-known Ritter and Schulten D(w) ∝ P(w)2/3 result for the steady state in which the network undersamples high-density and oversamples low-density regions. If −2 < n < 0, the network oversamples high-density regions, but the most interesting case occurs for n = 0 when D(w) ∝ P(w). In this case, the network forms a faithful representation of the input distribution. We performed simple Monte-Carlo simulations demonstrating the power law behavior for both update rules. The results are shown in Figure 1. For the n = 0 digital learning rules, the second variant (see equations 3.4 and 3.5) is not very good at untangling folds in the network because the position of the input has been discarded. Numerically, instead of algorithmically imposing a no-folding constraint, we took a shortcut by sorting and relabeling the units periodically. The second learning rule is purely a counting algorithm for achieving the equipartition steady state, so it must rely on the fact that the proper topology has already been formed. The first variant has the advantage of retaining the direction of ξ relative to the network units, necessary for the formation of the proper topology. However, this advantage comes at the cost of losing the pure counting aspect, which is essential for exact equipartition. Both n = 0 faithful representation learning rules are digital. This is interesting from both biological and computational perspectives; with the addition of an inhibitory population, the update rule can be implemented in a network with only binary communication between its elements! 4 Faithful Representation Networks for Independent Component Analysis 4.1 Branch Net for Finding High-Density Regions. Mixtures of sources with sharply peaked distribution functions will contain high-density directions corresponding to the independent component axes. Blind separation can be performed rapidly for this case in a net with one-dimensional branch topology, the conventional Voronoi partition, and the generalization of the
Separable Distributions
1309
Figure 1: Plot of log w(x) versus log(x) for P(w) ∝ w, showing power law behavior in a network with 100 units. The networks in the top, middle, and bottom plots were trained with n = 1, n = 0, and n = −1 learning rules, respectively. Equations 3.4 and 3.5 were used for n = 1 and n = 0 simulations; equation 3.3 was used for the n = −1 net. The three lines plotted are for log(w) ∝ (3/5) log(x), (1/2) log(x), and (1/3) log(x), corresponding to D(w) ∝ P(w)2/3 , P(w), and P(w)2 , respectively. Forty thousand data points were used in the n = 1 and n = 0 simulations; the units were sorted and relabeled every 5000 data points. For n = 1, κ = .005, σ = 10 for the first 20,000 points, and κ = .1, σ = 2 for the remaining 20,000 points. For n = 0, κ = .001; σ = 10 for the first 20,000 points, and σ = 5 for the remaining 20,000 points. For n = −1, to reduce the destabilizing effect of the singularity, the |ξ − wi |−1 term was capped at 1000, and the units were sorted every 1000 data points. The learning rate κ = .00002. For the first 30,000 points, σ = 20; then it was reduced to σ = 5 for 10,000 points.
1310
Juan K. Lin, David G. Grier, and Jack D. Cowan
n = 0 update rule given in equation 3.3:2 1wEi = κ3(²) · sgn(ξE − wEi ).
(4.1)
Numerically, we performed source separation and coding of two mixed signals in a net with the topology of two cross-linked branches (see Figure 2). The elements of the mixing matrix A were chosen randomly with uniform distribution between 1 and −1. The input was first prewhitened following Bell and Sejnowski (1995b) to reduce the amount of dilation and shearing required of the branch net. A typical simulation is shown in Figure 2. As seen in the figure, the branch net rotates very well, and since prewhitening tends to orthogonalize the independent component axes, much of the processing that remains after whitening is rotation to find the independence axes. The net quickly zeroes in on the independent component axes with little annealing. The neighborhood function is taken to be gaussian where ² is the distance to the winning unit along the branch structure. To demonstrate the generality of our approach, we show a faithful representation of a nonlinear mixture. Because our network has local dynamics, with enough units, the network can follow the curved independent component contours of the input distribution. The result is shown in Figure 3. With nonlinear mixtures, however, the transformation must be one-to-one in the region of interest. To unmix the input, the independent component contours can be found by using parametric least-squares fit to the branch contours. For the source separation in Figure 2, inserting the axes directions into an unmixing matrix W, we find a reduction of the amplitudes of the mixtures to less than 1 percent of the signal.3 This is typical of the quality of separation obtained in our simulations. Alternatively, taking full advantage of the input representation formed by the net, an approximation of the independent sources can E i∗ ). Or, as we be constructed from the positions in V of the winning unit (w show in Figure 4 for the nonlinear separation, the cell labels i∗ can be used to give a pseudowhitened signal approximation. Since there is only one winning unit along one branch, only one output channel is active at any given time. For sharply peaked source distributions such as speech, this does not hinder the fidelity of the signal approximation much since the input sources hover around zero most of the time. This property also has the potential for utilization in compression. However, for a full, rigorous whitened source representation, we must turn to a network with a topology that matches the dimensionality of the input data. 2 3
Here the sign function acts componentwise on · the vector. ¸ 1 0.0068 For the separation shown in Figure 2, WA = , where the largest .0054 1
elements of each row have been normalized to one.
Separable Distributions
1311
Figure 2: Independent component analysis by a branch architecture network. The two sources are audio files containing spoken Japanese phrases. The input distribution was prewhitened first. The branches initially lie along the ξ1 = ξ2 and ξ1 = −ξ2 directions and quickly spiral around to align with the independent component axes (dashed lines). The net is trained on 4000 randomly selected data points. For the annealing, κ = .0007; σ = 5 for the first 2000 points and σ = 2 for the remaining 2000 points. The configuration of the net is shown after every 200 data points, with the final unit positions marked with dots.
4.2 Aligned (M, N) Equipartition Net. The simplest equipartition of a separable distribution is the trivial partition generated from the independent component axes, which decouples the N-dimensional problem into N one-dimensional ones. This partition consists of N distinct hyperplane orientations, with cuts by, say, M parallel hyperplanes along each of the N orientations. We define this to be an (M, N) partition. Thus, the MN hyperplanes partition the input vector space into (M + 1)N regions. Our goal is to achieve the trivial equipartition using an SOFM with hypercube architecture. Here M + 1 is the number of units per dimension in the hypercube. For an (M, N) equipartition, since the number of constraints grows exponentially in N, while the number of degrees of freedom to define the MN hyperplanes grows only quadratically in N, with enough units per dimension
1312
Juan K. Lin, David G. Grier, and Jack D. Cowan
Figure 3: Input representation of a nonlinear mixture. The mixture is given by ξ1 = sgn(s2 ) · s22 + s1 − .2s2 , ξ2 = sgn(s1 ) · s21 − .4s1 + s2 . The network is trained on 6000 data points, with the position of the units shown every 200 data points in the figure. The learning rate κ = .0008; σ = 5 for the first 2000 points, σ = 2 for the next 2000 points, and σ = 1 for the remaining 2000 points. The independent component contours are drawn in the dashed lines. The units along the branches conform nicely to the statistically independent contours.
in the hypercube, the desired trivial equipartition will be the only (M, N) equipartition. Currently we believe as few as three units per dimension (M = 2) suffices for uniqueness. In order to achieve this independent component analysis (ICA) equipartition, the learning rule must be extended to many dimensions, and the definition of the partition needs to be modified. The second faithful representation learning rule is generalized to multidimensions as follows. With V =
E ² ), 1w E Ei = κ 0(E E iE∗ = More generally, 1w
(4.2) P
E ² ) = −0(−E E ² ). E where 0(E
Ei 1wEi ,
Separable Distributions
1313
Figure 4: Input representation achieved by the branch net in Figure 3. (a, b) Intensity versus time plots of sections of the two independent input sources (Japanese phrases). (c, d) Winning unit i∗ versus time along the corresponding branches of the net.
E iE∗ = 1w
X
E Ei , 1w
(4.3)
Ei
where 3(E² ) = 3(−E² ).
(4.4)
Instead of extending the continuum limit analysis for distributions that factorize, we show that a finite discrete network with the dynamics given above has equipartition as a steady state. Let qEi be the probability measure
1314
Juan K. Lin, David G. Grier, and Jack D. Cowan
of unit Ei. For the steady state, we require: E kEi = 0 h1w X X E · sgn(Ei − k) E + qE q Ei · 3(Ei − k) 3(kE − Ei ) · sgn(kE − Ei ) = k Ei
=
X Ei
Ei
E · sgn(Ei − k), E (q Ei − qkE) · 3(Ei − k)
E By inspection, equipartition, where qEi = q E for all units Ei, for all units k. k0 is a solution to the equation above. In a two-dimensional rectangular grid network with a range of one for 3(E² ) as measured with the city block metric, it can be shown that equipartition is the only steady state of the dynamics (see appendix B). However, it remains to be analyzed more generally the conditions under which this uniqueness property holds. Although we have shown that the network has a very desirable steady state, our desired ICA equipartition is not a proper Voronoi partition except when the independent component axes are orthogonal. To obtain the ICA aligned equipartition, it is necessary to modify the definition of the winning unitEi∗ . Let the input data be given by {ξE } = {AEs}, where Es is the source vector and A the nonsingular mixing matrix. Let E Ei ) = {ξE ∈
(4.5)
E Ei . Since an equipartition invariant be the winning region of the unit at w to simultaneous linear transformations of the input and representation is desired, we require that E = {Ä(Aw)}, E {AÄ(w)}
(4.6)
that is, Ä is equivariant under the action of A (see, e.g., Golubitsky, Stewart, E represents ξE , then Aw E represents & Schaeffer, 1988). This implies that if w E Aξ . The Voronoi tessellation resulting from the nearest–neighbor definition of the winning unit (see equation 3.2) satisfies this condition only for rotations and reflections, not nonsingular transformations in general. In two dimensions, we modify the tessellation by dividing up a primitive cell among its constituent units along lines joining the midpoints of the sides. For a primE cE, and d, E the region of the primitive cell itive cell composed of units at aE, b, represented by aE is the simply connected polygon defined by vertices at aE, E E E (Ea + b)/2, (Ea + d)/2, and (Ea + bE + cE + d)/4. More generally in N dimensions, the 2N vertices of the winning polytope region are made up of µ ¶ N j
Separable Distributions
1315
Figure 5: Partition of the primitive cell by the four constituent units. (a) Voronoi partition. (b) Modified equivariant partition.
averages of 2 j unit positions, for all j from 0 to N.5 The two partitions are contrasted in Figure 5. Our modified partition satisfies equation 4.6 for all nonsingular linear transformations since they are continuous and map polygons to polygons. Note, however, that this partition is valid only for networks with hypercube architecture and defined in the interior of the hypercube. The region outside the hypercube is partitioned by careful extrapolation of the boundary unit partitions. The learning rule given above was shown to have an equipartition steady state. It remains, however, to align the partitions so that it becomes a valid (M, N) partition. The addition of a local anisotropic coupling that physically, in analogy to elastic nets, might correspond to a bending modulus along the network’s axes will tend to align the partitions and enhance convergence to the desired steady state. The algorithmic details are given in appendix C. Numerically, we trained a two-dimensional 9 × 9 network to represent linear mixtures of two sources faithfully. The result is shown in Figure 6. Using least-squares linear fit to find the independent component axes, the mixture amplitudes are found to be approximately 1 percent of the signal.6 Geometrically, the units arrange themselves so that each unit has an equal probability of winning, with the axes of the net aligned with the independent component axes. From a coding perspective, an optimal quantization grid is formed by the network. With the ICA equipartition in place, the components of the winning unit iE∗ give whitened approximations of the independent sources. For compression, given a network with α units, if the elements in
5
6
Recall 2N =
PN
µ
j=0
N j
¶ .
·
0.06 The mixing matrix used in Figure 6 is A = 0.43
·
1 fit for W, we find WA = −0.016
¸
0.0041 . 1
¸
0.76 . Using least-squares linear 0.13
1316
Juan K. Lin, David G. Grier, and Jack D. Cowan
Figure 6: Plot in the ξE coordinate frame of the input distribution, independent component axes (dashed lines), and representation formed by the network. The units of the network are located at the grid points of the wire mesh. Initially the units were equally spaced and aligned with the ξˆ1 and ξˆ2 directions. A gaussian with σ = 8 was used for the neighborhood function 3(E² ), and the learning rate was κ = .00015. The network is shown after presentation of 4000 randomly chosen points in the input data set. Faithful sampling and alignment of the network along the independent component axes can be seen.
{ξ } can be in any one of β states, the compression factor achieved by the net’s input representation is log α/ log β. This is just the log ratio of the input and representation state-space sizes. If further data compression is desired, the activity of units closer than a threshold need not be differentiated.7 Often the threshold is set at a characteristic noise or sensitivity level. This practice, standard in the field of wavelet decomposition, can lead to highly compact information coding, especially for sharply peaked input source distributions (see, e.g., Daubechies, 1992; Koornwinder, 1993).
7
Distance as measured in the input vector space V.
Separable Distributions
1317
5 Discussion There are a few important differences between the (M, N) equipartition network and past SOFMs. First, the Voronoi partition has been replaced with an equivariant partition specially designed for separable input distributions. Second, in contrast to previous SOFMs, direct dependence on E Ei has been extracted out. Because the equipartition learning rule ξE and w (see equations 4.3 and 4.4) is purely in terms of the network’s internal architecture (E² ), it learns from its own coarse response to stimuli. This is an important aspect of our algorithm, though much work remains to be done on the analysis of these digital learning rules. Also, the question of stability of the steady state, which involves both the digital learning rules and the analog alignment algorithms, needs to be addressed. The folding problem alluded to earlier is closely tied to this digital counting aspect of our algorithm. For the square blind separation problem where the number of sources is equal to the number of sensors, by starting with a network with no folds and the right topology, the problem can be remedied by the addition of an algorithm that prevents folding. However, a potentially more serious problem arises from the indirectness of defining a partition from the SOFM units. Folds in a network are actually required to define certain partitions by the SOFM units. In practice, this should not pose a serious problem since only a few units per dimension are required for ICA. For separable input distributions, the (M, N) faithful representation network performs a nonparametric histogram density estimation while separating out statistically independent sources. In contrast to the nonlocal parametric blind separation work of Bell and Sejnowski (1995a, 1995b), this source separation approach is very general in the sense that no a priori assumptions about the individual source probability density distributions are made. They need not be symmetric, sharply peaked, unimodal, or even continuous. Furthermore, because of the local nature of the network, aside from being more robust, nonlinear mixtures of the sources can also be properly represented. From a coding perspective, intuitively, the most efficient encoding must require an understanding of the input. Arguably, conventional wavelet decompositions encode without processing, since a fixed scaling function φ and a derivative mother wavelet ψ are used to encode local mean and detail information, respectively (see, e.g., Daubechies, 1992; Koornwinder, 1993). In neural network terminology, wavelet decomposition is strictly feedforward, while the lateral coupling in our network allows for higher-order processing. Because of the uniqueness of the (M, N) equipartition for separable distributions, faithful representation coding is coupled with blind source separation. In our approach, local coding and processing combine to generate globally optimal coding and processing. In a sense, the units in our network are local independence and activity detectors.
1318
Juan K. Lin, David G. Grier, and Jack D. Cowan
6 Conclusion The organization principle of maximizing the entropy of a system’s response to stimuli ensures that on average there are no overactive or idle units in the network. This demand-based resource allocation principle underlies many of the organizational decisions we make but also provides an organizational basis for coding and processing on a microscopic level. By mixing in elements from statistics, information theory, and self-organized feature maps, we find networks with local dynamics that possess quite general coding and processing capabilities. This approach unifies processing and coding in the sense that the encoding formed is optimized to process previously encountered stimuli. In a network in which the representations are dynamically formed, no distinction is necessary between an early developmental and a late processing stage. Appendix A: One-Dimensional Continuum Limit Analysis For the update rule given in equations 3.4 and 3.5, we set the ensemble average h1wi to zero and go to the continuum limit with the net parameterized by the continuous variable x. Let P(ξ ) be the probability density function of the input, which we take to be differentiable. Taylor expanding about ² = x∗ − x, and keeping terms of order O(² 2 ), we need to solve h1wi = 0 Z = 3(²)²|w(x∗ ) − w(x)|n P(ξ )dξ µ ¶ Z n²w00 ≈ 3(²)²|²w0 |n 1 + (P + ²w0 Dw P)(w0 + ²w00 )d², 2w0 where
¯ dw(x) ¯¯ , w = dx ¯x 0
and
¯ dP(w(x)) ¯¯ . Dw P = dw ¯w(x)
Taking 3(²) to be an even function, all odd powers of ² vanish. The differential equation that remains is: µ ¶ n + 2 w00 Dw P w0 =− . (A.1) P 2 w0 Its solution is given by (w(x)0 )−1 ∝ [P(w)]( n+2 ) . 2
(A.2)
Separable Distributions
1319
The calculation for the other learning rule is almost identical and leads to the same result. Appendix B: Uniqueness of Equipartition Steady State Here a sketch of the proof is provided for the uniqueness of the equipartition steady state in a two-dimensional rectangular grid network. The range of 3(E² ) as measured with the city block metric is one. The labels for the probability measure of the top two rows of the network are taken to be q(1,1) , q(2,1) , . . . , q(L,1) , and q(1,2) , q(2,2) , . . . , q(L,2) . The steady-state equation for the corner unit at (1, 1) imposes the constraints q(1,2) − q(1,1) = q(2,1) − q(1,1) and q(1,1) − q(2,2) = q(2,1) − q(1,1) . Continuing down the row of border units to the other corner unit at (L, 1), the steadystate equations constrain the probability measures of all the units in these two rows to be equal. The equations for the second row units extend the equipartition to the third row. This proceeds until the probability measures of all the units are shown to be equal. Thus, the equipartition steady state is the only steady state of the dynamics. Appendix C: Alignment Algorithms Although the exact form of the alignment algorithms is not essential, we describe here some of the simple local algorithms used in our simulations. First, after each update, the units in the network were moved toward a boxcar convolution of the unit positions. The kernel used was (1/3, 1/3, 1/3) in one dimension and (1/3, 1/3, 1/3)t in the other dimension. The units were moved a fraction of the distance toward the smoothed positions. For Figure 6, the fraction was set to 0.02. This tends to straighten out the individual rods in the network. Second, we implemented an algorithm that attempts to make each primitive cell a parallelogram. Unit positions were adjusted to try to equalize the opposing lengths in each primitive cell as follows. In a primitive cell E cE, and d, E composed of units at aE, b, η E E + l2/ l4) + (Ea − b)(−1 + l3/ l1)], aE → aE + [(Ea − d)(−1 2 E |bE − cE|, |Ec − d| E and where η is a gain term and l1, l2, l3, l4 are the lengths |Ea − b|, E |d − aE|. The other three units are also updated accordingly. For Figure 6, η was set to 0.05. This algorithm has the effect of making the individual rods parallel. Acknowledgments We thank Trevor Mundel, Alexander Dimitrov, and the anonymous referees for many helpful comments.
1320
Juan K. Lin, David G. Grier, and Jack D. Cowan
References Atick, J. J., & Redlich, A. N. (1990). Towards a theory of early visual processing. Neural Comp., 4, 196–210. Barlow, H. B. (1989). Unsupervised learning. Neural Comp., 1, 295–311. Bauer, H.-U., Der, R., & Herrmann, M. (1996). Controlling the magnification factor of self-organizing feature maps. Neural Comp., 8, 757–771. Bell, A. J., & Sejnowski, T. J. (1995a). An information-maximization approach to blind separation and blind deconvolution. Neural Comp., 7, 1129–1159. Bell, A. J., & Sejnowski, T. J. (1995b). Fast blind separation based on information theory. In Proc. 1995 International Symposium on Nonlinear Theory and Applications (Vol. 1, pp. 43–47). Tokyo: NTA Research Society of IEICE. Daubechies, I. (1992). Ten lectures on wavelets. Philadelphia: SIAM. DeSieno, D. (1988). Adding a conscience to competitive learning. IEEE Int. Conf. on Neural Computing, 1, I117–I124. Field, D. J. (1994). What is the goal of sensory coding? Neural Comp., 6, 559–601. Golubitsky, M., Stewart, I., & Schaeffer, D. G. (1988). Singularities and groups in bifurcation theory. Berlin: Springer-Verlag. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag. Koornwinder, T. (Ed.) (1993). Wavelets: An elementary treatment of theory and applications. Singapore: World Scientific. Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Comp., 1, 402–411. Preparata, F. P., & Shamos, M. I. (1988). Computational geometry: An introduction. New York: Springer-Verlag. Ritter, H., & Schulten, K. (1986). On the stationary state of Kohonen’s selforganizing sensory mapping. Biol. Cybern., 54, 99–106. Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana: University of Illinois Press.
Received May 8, 1996; accepted November 22, 1996.
Communicated by Klaus Obermayer and Geoffrey Goodhill
Self-Organized Formation of Various Invariant-Feature Filters in the Adaptive-Subspace SOM Teuvo Kohonen Samuel Kaski Harri Lappalainen Helsinki University of Technology, Neural Networks Research Centre, Espoo, Finland
The adaptive-subspace self-organizing map (ASSOM) is a modular neural network architecture, the modules of which learn to identify input patterns subject to some simple transformations. The learning process is unsupervised, competitive, and related to that of the traditional SOM (self-organizing map). Each neural module becomes adaptively specific to some restricted class of transformations, and modules close to each other in the network become tuned to similar features in an orderly fashion. If different transformations exist in the input signals, different subsets of ASSOM units become tuned to these transformation classes. 1 Introduction A long-standing problem in artificial perception has been to recognize patterns subject to simple transformations, such as translation, rotation, or scaling. A frequently held view is that invariances can be achieved by filtering of the pattern features that then become mapped into respective invariance classes. Another problem, equally important and intriguing, then ensues: how to determine a set of effective feature filter functions for arbitrary statistics of dynamic signals. These problems seem to have been solved in biological systems in some effective way. In this article, we devise a corresponding technical solution. In traditional image analysis and pattern recognition techniques, the primary observations are usually preprocessed in a separate stage that extracts a number of invariant features. On the basis of these features, a statistical decision-making machine is then able to identify or classify the target invariantly. In applications where no prior standardization of the samples is possible—for instance, when the targets are moving or their illumination is varying—the wavelet transforms such as the Gabor transforms (Gabor, 1946; Daubechies, 1990) have become popular as extractors of approximately invariant local features. These transforms, although optimizable by standard techniques, have the limitation that their mathematical forms must be fixed a priori. The learning scheme discussed in this article differs from all the previNeural Computation 9, 1321–1344 (1997)
c 1997 Massachusetts Institute of Technology °
1322
Teuvo Kohonen, Samuel Kaski, and Harri Lappalainen
ous approaches in that the forms of the filter functions are learned directly from short segments or episodes of dynamical signals or pattern samples. If the samples of the episode can be derived from each other by some linear transformation such as translation, they span a special manifold in the input signal space, namely, a linear subspace. The ASSOM (adaptive-subspace self-organizing map) architecture discussed in this article was introduced by one of the authors in 1995 (Kohonen 1995a, 1995b, 1995c, 1996); this article is a continuing study of it. In this scheme, the various feature filters emerge in learning and become tuned to those manifolds of input patterns that are defined only transiently in short segments of signals but that occur sufficiently often. These filters thus learn not the pattern prototypes as such, but the different low-dimensional manifolds spanned by the consecutive signal pattern vectors. The discussion here constitutes a viable alternative to the standard principal component analysis (PCA) methods of feature extraction (Hotelling, 1933) for which neural approaches have been presented (Oja, 1983, 1992; Rubner & Tavan, 1989; Cichocki & Unbehauen, 1993). The PCA extracts average features that reflect the global stationary statistical properties of the pattern sequence. The ASSOM, on the contrary, creates a set of statistical representations referring to different temporal events, and by competitive (winner-take-all) selection and learning, each of the representations selectively concentrates on a different class of temporal episodes. Moreover, due to the spatial interactions between the processing units of the ASSOM during learning, characteristic of all the SOM (self-organizing map) methods (Kohonen, 1982, 1995c) in general, the various feature filters emerging in this process will be organized spatially, processing units being ordered along some characteristic feature coordinates. We should mention a couple of recent works (Foldi´ ¨ ak, 1991; Wallis, Rolls, & Foldi´ ¨ ak, 1993) in which neural units learn sequences of simple patterns. In these works, a fixed input layer of a two-layer architecture projects to a competitive second layer, where the winning unit learns a trace of successive inputs. The results obtained with the model do not, however, explain the emergence of the input-layer filters, which are essential for learning arbitrary manifolds. Effective and organized nontrivial invariant-feature filters can ensue only from the cooperation of several kinds of neural components, such as those constituting the ASSOM network architecture discussed here.
2 Auxiliary Concepts First we set out to discuss the following subproblems in this section: invariant recognition of a pattern as invariant recognition of all its feature components, definition of invariance classes for feature components as linear subspaces, and adaptive learning of the invariance classes from short sequences of input patterns called the episodes.
Formation of Invariant-Feature Filters
1323
2.1 Decomposition of a Pattern into Features Mapped into Invariance Classes. Vectors that represent dynamic sensory environments are usually distributed along manifolds, which are continuous subsets or zones in the vector space. Consider tentatively that the input pattern represents some fundamental feature component of a more complex object and is thus very simple, such as the Gabor functions are. Its variations, caused by a particular class of transformations of the basic pattern such as translation, can then be thought to span a low-dimensional manifold. For the simplest transformations, the manifold can even be represented by a linear subspace, in which every vector is linearly dependent on the others. The class of linear transformations of the same fundamental component will be mapped into the same linear subspace, which is then regarded as an invariance class. If a more complex pattern can then be decomposed into a sum of such simple feature components, approximately at least, then all terms in the sum become mapped into their respective linear subspaces invariantly, whereby the whole pattern will be represented invariantly with respect to this same linear transformation. 2.2 Sample Vectors. The input to a neural network model may be described by a real vector x ∈
(2.1)
Alternatively, the input vector x may be composed of samples of a twodimensional input image, collected from points with intensity I around a time-dependent reference point r = r(t) ∈ <2 . These sampling points are displaced relative to r by the amounts si ∈ <2 : x(t) = x[r(t)] = [I(r(t) − s1 ), I(r(t) − s2 ), . . . , I(r(t) − sn )]T .
(2.2)
The vi and si , respectively, form a regular or irregular sampling grid. The time series and the images, as well as patterns eventually representable in higher-dimensional domains, will be discussed below in the framework of the general vector space formalism. 2.3 Episodes, Signal Subspaces, and Orthogonal Projections. One of the basic problems is how the linear subspaces, the invariance classes, can be defined; another is how to test whether an unknown pattern belongs to a subspace. Consider, for instance, time-domain signals. If N sample vectors (N < n) are collected from successive reference points tp , the set of the nearby sample vectors {x(tp )} spans some linear signal subspace of the total input space. The set of the reference time instants S = {tp } will be called an episode. The maximum dimensionality of this subspace is equal to the number of
1324
Teuvo Kohonen, Samuel Kaski, and Harri Lappalainen
sampling instants in S , but the dimensionality can be much smaller if the sample vectors are dependent, as is often the case. The connection between subspaces and invariances lies in realizing that the signal subspace itself is an invariant feature of the input vector x. A linear subspace L of dimensionality H is in general completely defined by the general linear combination of the linearly independent basis vectors b1 , . . . , bH . However, the choice of the bh is not unique; there exist infinitely many equivalent combinations of the bh for the same L, in some of which the bh can be orthonormal. If a set of basis vectors is known, a set of equivalent orthonormal basis vectors for L can be computed by the familiar GramSchmidt process (for a textbook account, cf. Kohonen 1989, 1995c). The basic operation to determine whether an unknown vector x belongs to L is the orthogonal projection of x on L, denoted xˆ . If xˆ = x, then x belongs to L. For an arbitrary x, one can define its distance from L as k˜xk = kx − xˆ k, where the norm is Euclidean. The orthogonal projection xˆ can be computed as either of the following simple expressions, provided that the bh are orthonormal: Ã xˆ =
H X
! bh bh
T
x=
h=1
H X
(bh T x)bh .
(2.3)
h=1
The latter expression is computationally lighter. In the case that x does not belong to any subspace exactly, it can still be identified with the closest one. The factor in the first parenthesis of equation 2.3 is a projection operator matrix abbreviated P, and there holds P2 = P, PT = P. The pattern space is divided into pattern zones, each of which contains one of the above subspaces. The optimal separating surface between the ith and jth zone is the set of those points x, the projections of which on the respective subspaces L(i) and L(j) have equal norms. The separating surface is thus defined by the following quadratic equation in x: 2
2
xT P(i) x − xT P(j) x = xT (P(i) − P(j) )x = 0,
(2.4)
where P(i) and P(j) are the projection operators to L(i) and L(j) , respectively. 2.4 The Network Model. It may be helpful to imagine that the set of subspaces, the invariance classes, corresponds to a two-layered neural architecture, organized into modules that function as processing units (see Figure 1). Each of the modules represents a subspace denoted by L(i) . The neurons in the input layer are linear; if the weight vector of neuron h in the input layer of module i is denoted by b(i) h and the input by x, respectively, (i) T the output of this neuron is x bh . The weight vectors b(i) h form the basis of the subspace L(i) , and they will be referred to as basis vectors. The neurons in the output layer have some quadratic transfer function Q. If in practical
Formation of Invariant-Feature Filters
1325
x
Q
Q
Q
squared lengths of projections of x on the due subspaces
Figure 1: Neural network that consists of linear-subspace modules. The modules are separated by dashed lines. Q: quadratic neuron.
simulations the vectors b(i) h in each module are kept mutually orthonormal, Q is simply the sum of squares of the inputs of the second-layer neuron. The sum of squares formed by the output neuron in each module equals the squared norm of the projection xˆ (i) of the input on the subspace L(i) , or kˆx(i) k2 . The output of the module can then be interpreted as the degree of matching between the input and the subspace. Thus, the output of each module is invariant (neglecting eventual changes in the norm of the input) to restricted linear transformations that occur within the subspace represented by the module. In the competitive learning process discussed below, the modules learn to represent different invariance classes. Competitive learning in this network also involves neighborhood interactions between the modules similar to the neighborhood function in the standard SOM algorithm (Kohonen 1982, 1995c). Due to these interactions, neighboring modules learn to represent similar features, or similarly transformed features, and the array of the modules becomes tuned to the features in an orderly fashion. In the larger scale, different areas in the network may become specialized to different input types, such as transforms, as a result of these interactions. 3 Theory of the ASSOM In this article, we not only want to create sets of feature filters; we also want these filters to become self-organized into a spatial order that reflects the similarity of the feature values they detect. Such a “map” is easier to inspect, and the artificially produced “maps” can then be better compared with those existing in the neural realms. 3.1 Learning. The ASSOM units learn to become invariant, approximately at least, to the transformations that exist in their environment by modifying their own subspaces to improve matching with the input signal subspaces. The latter are spanned by the successive samples in the episodes S , as discussed in section 2.3.
1326
Teuvo Kohonen, Samuel Kaski, and Harri Lappalainen
In order to approximate the input episodes by subspaces of low dimensionality, in the experiments described here, we fixed the ASSOM modules to have two input neurons, the minimum number required to extract invariant features (cf. Gabor filters). It is also quite possible to use more input neurons. When these modules are made to compete on the input signal subspaces, the module that represents a given signal subspace best wins the competition, and it, together with its neighbors in the array, is adapted to represent the input subspace even better. In this way the different modules become gradually specialized to represent different types of input subspaces. As the modules in the neighborhood of the winner are also adapted to represent the input better, the neighboring modules gradually learn to represent similar inputs; this effect can be regarded as a kind of smoothing of the neighboring representations. Such “neighborhood learning” was also the basis of the SOM algorithm, where the weight vectors of processing units become spatially ordered in the map array. The ASSOM learning process can be summarized in general terms as follows: 1. Locate the module (“winner”) c, the subspace L(c) of which is closest to the signal subspace of the input episode S . 2. Adjust the subspace L(c) of the winning module and modules in its neighborhood in the ASSOM array closer to the signal subspace. Direct comparison of the signal subspace with the subspaces of the neural modules could be rather difficult, for instance, because the dimensionality of the input signal subspace may be defined vaguely due to noise in the samples. A simpler and much more robust definition of match between the input subspace and the neural subspaces is comparison of each input sample of the episode separately with the neural subspaces, and computation of the “energy” (sum of squares) of the projections of the input samples of the episode on each of the subspaces of the ASSOM modules. The module with the largest energy—the sum of the squared projections of the samples collected during an episode—wins the competition. Denote the projections of the samples collected during S by xˆ (i) (tp ). The representative winner for the whole episode S , again indexed by c, is then defined by the expression X c = arg max kˆx(i) (tp )k2 , i t ∈S p
or equivalently, c = arg min i
X t ∈S p
k˜x(i) (tp )k2
.
(3.1)
The final goal in the competitive learning of subspaces is to minimize the
Formation of Invariant-Feature Filters
1327
average expected error when a stochastically selected input subspace is represented by the subspace of the winning module. A somewhat related error principle was also used in the classical vector quantization (Gersho, 1979; Gray, 1984; Makhoul, Roucos, & Gish, 1985). The distance between the subspace spanned by the input samples and the subspace L(i) will here be measured in terms of the energy of the relative projection errors of the input samples. Consider tentatively that no neighborhood interactions between the modules are yet taken into account and that the input vectors are normalized. The average expected distance between the input subspace and the subspace of the winning module, computed over the space of all possible Cartesian products X of input samples collected during the episodes S , can be measured in terms of the projection error x˜ (c) and is expressed by the objective function Z X k˜x(c) (tp )k2 p(X) dX. (3.2) E= tp ∈S
Here p(X) is the probability density of the X, and dX is a shorthand notation for a volume differential in the integration space, the Cartesian product of all the samples of the episode. The index of the winning module c depends on the whole episode, as well as on the subspaces L(i) of all modules i. Thus the exact minimization of equation 3.2 by differentiation may be very complicated (Kohonen, 1991, 1995c; for recent alternative treatments, see Buhmann & Kuhnel, ¨ 1993; Luttrell, 1994). Therefore we shall resort to the classical Robbins-Monro stochastic approximation (Robbins & Monro, 1951) in minimizing equation 3.2 in the same way as in the derivation of the traditional SOM algorithms (Kohonen, 1991, 1995c). The objective function need not be a Lyapunov or energy function; nonetheless, the convergence of the stochastic approximation has been established generally (Albert & Gardner, 1967; Kushner & Clark, 1978; Wasan, 1969). The objective function (see equation 3.2) describes the goal of pure unordered competitive learning. In Kohonen (1991, 1995c), to enforce ordering to the old vector quantization problems, the integrand in the error function E was smoothed locally with respect to the processing modules, which leads to the SOM algorithms. The same modification can also be made here by (i) multiplying the integrand with the neighborhood kernel hc , which is a decreasing function of the distance between the modules c and i in the ASSOM array. The new objective function then reads Z X X h(i) k˜x(i) (tp )k2 p(X) dX. (3.3) E1 = c i
tp ∈S
Although attempts to optimize cost functions of the type in equation 3.3
1328
Teuvo Kohonen, Samuel Kaski, and Harri Lappalainen
have not succeeded in closed form, it has been possible to use the RobbinsMonro stochastic approximation to derive the SOM algorithm (Kohonen, 1991, which also compared the exact and approximate optima). In this approximation, the gradient of E1 is evaluated on the basis of the last input x and the last values of b(i) h during the sampling sequence. Here the whole set X = X(t) is regarded as constituting a “sample” in the stochastic approximation. Thus, instead of optimizing the original objective function E1 , in this approximation a step in the direction of the negative gradient of the sample function, E2 (t), is taken: E2 (t) =
X
h(i) c
i
X
k˜x(i) (tp )k2 .
(3.4)
tp ∈S(t)
Let us recall that x˜ (i) = x − xˆ (i) , where xˆ (i) , as a function of the b(i) h , is given by equation 2.3. If the basis vectors are kept orthonormal, the gradient of the sample function with respect to basis vector b(i) h is ∂E2 ∂b(i) h
= −2h(i) c
X h tp ∈S(t)
i x(tp )x(tp )T b(i) h .
(3.5)
Comment 1. As index c would be changed abruptly when the signal subspace passes the border between its two closest subspaces L(i) , it will be necessary to stipulate that the signal subspace does not belong to the infinitesimal neighborhood of such a border in equation 3.5. For continuous, stochastic signals, this possibility can be neglected. (See a similar discussion in Kohonen, 1991, 1995c.) Comment 2. Because the basis vectors that span a subspace are not unique, it might seem that the gradient-descent method cannot be used because the optimum is not unique. Notwithstanding, it may be clear that any of these optima (for which E2 is minimized) is an equivalent solution, so it will suffice if one optimal basis is found. With a step length of λ(t)/2 in the direction of the negative gradient, we obtain the rule ∂E2 1 (i) b(i) h (t + 1) = bh (t) − 2 λ(t) ∂b(i) h X = I + λ(t)h(i) x(tp )x(tp )T b(i) c h (t).
(3.6)
tp ∈S(t)
The parameter λ(t) is called the learning-rate factor. In stochastic approximation, the λ(t) must satisfy the P∞ 2 (Robbins & Monro, P following conditions λ(t) = ∞, 1951; Kushner & Clark, 1978): ∞ t=0 t=0 λ (t) < ∞.
Formation of Invariant-Feature Filters
1329
In the learning law (see equation 3.6) the operator, R1 (λ(t)) = I + λ(t)h(i) c
X
x(tp )x(tp )T ,
(3.7)
tp ∈S(t)
is applied to the basis vector b(i) h . In the derivation of the learning law, the input vectors were tentatively assumed to be normalized. If they are not, the normalization can be incorporated into the objective function (see equation 3.3) by considering relative projection errors k˜x(i) (tp )k2 /kx(tp )k. Then the rotation operator (see equation 3.7) becomes R2 (λ(t)) = I + λ(t)h(i) c
X x(tp )x(tp )T . kx(tp )k2 t ∈S(t)
(3.8)
p
Earlier (Kohonen, 1995c, 1996) we showed that the following slightly different law, a product of unnormalized rotation operators, can be used for the competing neural units to learn the different input subspaces: R3 (λ(t)) =
Y tp ∈S(t)
" I+
x(tp )x(tp )T λ(t)h(i) c kx(tp )k2
# .
(3.9)
When λ(t) is small, equations 3.8 and 3.9 can immediately be found equivalent. In fact, equation 3.8 can be seen as a batch version of equation 3.9, where the whole episode is used in one operation. One of the authors earlier proved the so-called adaptive-subspace theorem (Kohonen, 1996), which states that if there is only one subspace L(i) , it will converge to L∗ ⊆ L(s) almost surely, where L(s) is the signal subspace from which the samples x(tp ) are picked at random. We shall henceforth use the operator R3 (λ(t)), because it applies the input samples x(tp ) in succession, and some extra operations explained in section 3.2 shall be made after each operation. The operator R3 (λ(t)) in fact corresponds to Hebb-type learning: the modification of a synapse (a component of a basis vector) is proportional to the product of its input and the output of the neuron. Since expressions 3.8 and 3.9 do not preserve the norms of the basis vectors, for good performance the basis vectors should be orthonormalized, at least after some number (e.g., a few dozen) of learning steps. 3.2 Enhancement of the Organizing Power. In order to achieve a sufficient stability in the self-organizing process and to obtain well-distributed filter sets, the advice given in this section is necessary. First, to ensure effective self-organization, any correction made on a neural unit must be a monotonically increasing function of the error. If this were
1330
Teuvo Kohonen, Samuel Kaski, and Harri Lappalainen
not the case, the corrections made in neighboring units would strongly depend on their basis vectors before correction, which would lead to unstable self-organization. The magnitude of the correction made on the subspace L(i) should be a monotonically increasing function of the true error k˜x(i) k = kx − xˆ (i) k or a monotonically decreasing function of kˆx(i) k. In the projective methods, the correction (see equation 3.6) would be proportional to x(tp )T b(i) h , which decreases with increasing angle between x(tp ) and b(i) . A simple method h for guaranteeing that the correction is an increasing function of error (cf. Kohonen, 1996) is to divide the learning-rate factor λ(t) by the scalar value kˆx(i) k/kxk, which changes only the effective learning rate. The operator (see equation 3.9) then becomes (i)
R (t) =
Y tp ∈S(t)
" I+
λ(t)h(i) c
# x(tp )x(tp )T . kˆx(i) (tp )kkx(tp )k
(3.10)
In the course of our simulations, reported in section 4.2, it turned out that the organization of the ASSOM was not always perfect even when the modified rotation operator (see equation 3.10) was used. For instance, when the input consisted of speech signals, some modules were prone to become tuned to two separate frequency bands. An explanation of this phenomenon is that a high-order linear in general can easily learn a rather complicated passband function, even one with many passbands, and it is mainly by the virtue of the competitive-learning process that single passbands at the filters are at all formed, because the process tries to distribute them in an orderly fashion over the frequency domain. When such two-band filters were formed in self-organization, the frequency range between these two bands remained unrepresented by the ASSOM array; that is, a gap in the distribution was formed (see section 4.2). Without further speculation it may be mentioned that this instability can be eliminated by the following simple method, which has been empirically demonstrated to produce filters that are much better distributed in the frequency domain. If, after each learning step, the magnitude of small (i) components of the basis vectors bh is set to zero, then the basis vectors will be forced to approximate the stronger and more fundamental frequency (i) (i) T n components. If we denote the basis vectors by b(i) h = [bh1 , . . . , bhn ] ∈ < , then a very simple dissipation effect of this kind can be described by the equation (i)
(i) b0 hj = sgn(b(i) hj ) max(0, |bhj | − ε),
(3.11)
where ε (0 < ε < 1) is a very small term. The amount of dissipation should be a fraction of the magnitude of the correction: ε = εh(i) (t) = δ|b(i) h (t) −
Formation of Invariant-Feature Filters
1331
b(i) h (t − 1)|, where δ is a small scalar parameter. The dissipation is applied after each rotation and before normalization. Other regularizing principles have also increased the stability of selforganization in our experiments. For instance, there exist excessive “parasite” degrees of freedom in the representation of basis vectors because of various symmetries. These parasite degrees of freedom often result in asymmetric filters; for instance, the filters may become centered at any point in the sampling grid. In order to avoid the ASSOM from being organized along irrelevant feature dimensions, there is a simple method for favoring the middle of the sampling grid without significantly affecting the learning process in other respects: weighting the lattice points by a gaussian function centered in the middle. If the function is wide enough, it affects only the components of the basis vectors quite near the edges. Since all input vectors are weighted similarly, this weighting does not alter the transformation group inherent in the input vectors but enhances symmetry. To speed up the organization process, the gaussian weighting function can also be made time dependent. In the beginning of the organizing process, the function could be selected as rather narrow, in order to fix the middle points first. After that, the width of the gaussian is increased, and the modules learn to represent their inputs more and more accurately using more and more peripheral points of the sampling grid. Still another method for enhancing the organizing power, a standard method for warranting good organization in ordinary SOMs, is to make (i) (i) the neighborhood kernel dependent on time, hc = hc (t). In the beginning of learning, the kernel should be very wide, effectively comprising half or even more of the map. When the width then gradually decreases to its final value, the map approximates its inputs more closely, while at the same time preserving the global ordering achieved in the beginning. In our simulations, we used a simple box-shaped kernel. The modules up to a certain radius from the winner were thereby updated equally (h(i) c (t) = 1), whereas the other modules were not updated at all during this learning step (h(i) c (t) = 0). 3.3 Summary: The ASSOM Learning Algorithm. For each learning episode S (t) consisting of successive time instants tp ∈ S (t) do the following: • Find the winner, indexed by c:
c = arg max i
X t ∈S(t) p
||ˆx(i) (tp )||2
.
1332
Teuvo Kohonen, Samuel Kaski, and Harri Lappalainen
• For each sample x(tp ), tp ∈ S (t): 1. Rotate the basis vectors of the modules: # " x(tp )x(tp )T (i) (i) b(i) bh (t + 1) = I + λ(t)hc (t) (i) h (t). kˆx (tp )kkx(tp )k (i) 2. Dissipate the components b(i) hj of the basis vectors bh : (i)
(i) b0 hj = sgn(b(i) hj ) max(0, |bhj | − ε),
where (i) ε = εh(i) (t) = δ|b(i) h (t) − bh (t − 1)|.
3. Orthonormalize the basis vectors of each module. The orthonormalization can also be made more seldom, say, after every hundred steps. Above, λ(t) and δ are suitable small scalar parameters. 4 Experiments We next show what kinds of invariant-feature filters will emerge in the ASSOM learning process in different sensory environments, that is, when the input to the ASSOM is subjected to various transformations. The emerging filters reflect statistical characteristics of short sequences of input patterns but not waveforms as such. The filters will only learn such pattern components present in short input sequences that satisfy certain linear constraints, such as lying on a linear subspace. Patterns that do not satisfy these constraints will be averaged out in the long run. The resulting basis vectors in fact represent kernels of transforms that produce the invariant outputs. Before reporting the final results, we first define some details of simulations. Then we demonstrate the enhancement of the organizing power due to the nonlinear dissipation effect introduced in section 3.2. 4.1 Details of the Simulations. In this section we summarize the kinds of input patterns used and how they were transformed. 4.1.1 Translated Speech Signal. In the first experiments, the input signal consisted of digitized time-domain speech waveforms from the TIMIT (1988) speech data set, which contains phrases spoken by a large number of U.S. speakers, sampled at 12.8 kHz. Before picking up the input vectors, high frequencies in the signal were enhanced by taking differences of successive samples (high-pass filtering). The input vectors consisted of 64 successive samples of the speech waveform. Each learning episode began at a randomly chosen time instant and consisted of eight vectors, each displaced by a random amount (eight sam-
Formation of Invariant-Feature Filters
1333
t1 t
2
t8 Figure 2: A sample speech waveform. The locations of the eight input vectors (with starting times t1 , . . . , t8 ) belonging to an episode are identified with the line segments under the speech waveform.
ples on the average) from the previous one. Thus, the transformation group inherent in the data consisted of translations in time. A segment of the original signal and the eight input vectors forming an episode are shown in Figure 2. 4.1.2 Two-Dimensional Colored Noise. Earlier (Kohonen, 1995c) we demonstrated that two-dimensional filters can be formed in response to photographic images. Here we wanted to specify the images statistically. Thus, in the rest of the experiments reported here, two-dimensional patterns consisting of colored noise were used. The patterns were formed by low-pass filtering of white noise with a second-order Butterworth filter that had the cutoff frequency at 60 percent of the Nyquist frequency. The image was sampled with a circular, uniformly spaced sampling grid consisting of 316 pixels. In the experiments, the inputs were translated horizontally, rotated, or scaled. The center of both the rotation and the scaling operations was fixed to be the center of the sampling grid. The signal subspaces spanned by the samples of the episode can be regarded as low dimensional only if the transformations between the samples are modest. The amount of translation was varied from 0 to 10 pixels, using equal steps and a 10 percent random
1334
Teuvo Kohonen, Samuel Kaski, and Harri Lappalainen
a b c t1
t2
t3
t4
t5
t6
Figure 3: Samples of transformed colored noise patterns. (a) Translated inputs. (b) Rotated inputs. (c) Scaled inputs. Each horizontal row forms an episode.
displacement at each. The amount of rotation ranged from 0 to 1 radian, divided into equal steps and having 10 percent random rotation at each step. The amount of scaling used in the episode increased with equal steps from 1.0 to 1.5, with a 10 percent random variation applied at each step. The translated images were weighted by a vertical gaussian kernel, to force all templates to become vertically centered at the same location (see section 3.2). The other kinds of inputs, scaled and rotated, were not weighted. The inputs were high-pass-filtered in time by taking differences of successive input vectors to enhance the changing parts of the input. High-pass filtering is necessary, for example, for scaled images where the center of the scaling remains approximately constant. A well-functioning invariance filter would then model predominantly this invariant area instead of the changes in the stimuli produced by scaling. Before filtering, the number of samples in each episode was six, and the number decreased to five after taking the differences. Exemplary episodes of translated, rotated, and scaled colored noise are shown in Figure 3. 4.1.3 Methods. The ASSOM lattice consisted, unless otherwise stated, of 24 units organized as a one-dimensional array. Each ASSOM unit contained (i) (i) two basis vectors b1 and b2 , which were initialized with random values before learning. All inputs were normalized to unit length. The learning process was summarized in section 3.3. The neighborhood kernel h(i) c was box shaped with h(i) = 1 in the neighborhood of the winner and h(i) c c = 0 for the other units. The basis vectors were orthonormalized after each adaptation step. In case the computing time is critical, orthonormalization
Formation of Invariant-Feature Filters
radius of
h(ci) (t)
8
1335
h(ci) (t) (t) FWHM(t)
radius of 6
0.100 FWHM(t)
(t)
60
0.075
4
40
0.050
2
20
0.025
0
T=3
2T=3 learning episode
T t
Figure 4: The learning parameters versus number of learning episodes. The learning-rate factor λ(t) was of the form 0.1 · T/(T + 99t), where T is the total number of episodes during learning. In T steps the radius of the box-shaped neighborhood decreased linearly from seven lattice spacings to one spacing. The gaussian function used for weighting the sampled input pattern changed according to the curve marked by FWHM(t), full width at half maximum, expressed in terms of sampling points. The FWHM increased linearly from 8 to 50 sampling points.
can be made more seldom. During the learning process, the parameters changed according to Figure 4. 4.2 Demonstration of the Enhancement of the Organizing Power. 4.2.1 Preliminary Experiment Without Dissipation. In the first experiment with the ASSOM algorithm, the input consisted of translated speech signals. After 2000 learning episodes, the ASSOM units represented different frequency components in an orderly fashion (see Figure 5). The two basis (i) vectors in each unit, b(i) 1 and b2 , became tuned to an approximately similar frequency band and had a 90-degree phase shift with respect to each other. Each output (squared projection of the input on the subspace of a unit) of the units became selective to a certain frequency band and was invariant to translations in time, as long as the frequency content of the input did not change. If the learning period was extended, it turned out that the organizing result was not stable. In continued learning, a gap appeared in the representation. Some units had the tendency of representing several, usually two, distinct frequency components instead of one compact frequency band, and
1336
Teuvo Kohonen, Samuel Kaski, and Harri Lappalainen
a
b(1 )
b
b(2 )
i
i
i=
a
b(1 )
b
b(2 )
1
2
3
4
5
6
7
8
9
10 11 12
i
i
i=
13 14 15 16 17 18 19 20 21 22 23 24
Figure 5: Tentative results with translated speech waveforms as inputs. Notice that the diagrams are in two parts. (a) The first basis vectors of the ASSOM units (i) (b(i) 1 ). (b) The second basis vectors (b2 ).
some range of the input frequencies was left unrepresented. This occurred especially at the minima of the frequency spectrum. Figure 6 shows the filters after a learning process consisting of 100,000 episodes. There is a clear defect in the organization: a wide gap at frequencies 2000 through 3500 Hz in the distribution of the dominant frequencies of the units (see Figure 6e). It is notable that the frequency response, defined as the output of a unit as a function of the input frequency, of unit number 18 (closest to the gap) consists of two separate peaks corresponding to the borders of the frequency gap (see Figure 6d). The ability of the ASSOM filters to represent several frequencies at the same time was discussed in section 3.2. 4.2.2 Reducing Excessive Degrees of Freedom Using Dissipation. After the dissipation term discussed in section 3.2 was introduced, the frequency responses of all units obtained a single, relatively narrow passband (see Figure 7d). The value 0.02 was used for the dissipation constant δ. The gap in the distribution of the dominant frequencies of the ASSOM units was thereby closed (see Figure 7e). The steepest slope in the distribution of filters corresponds to a local minimum in the frequency spectrum. If Figures 7a–b are inspected carefully, the effect of nonlinearity due to dissipation can be seen as small shoulders in the basis vectors at small values. It may be interesting to note that an effective gaussian-shaped time window was determined by the wavelet shapes (see Figure 7c), and its width seemed to be optimized for each filter separately.
Formation of Invariant-Feature Filters
a
b(1 )
b
b(2 )
c
a
1337
i
i
(i)
d F
(i)
i=
a
1
2
3
4
5
6
7
8
9
10 11 12
b
(i) 1
b b c a
(i) 2
(i)
d F
(i)
i=
13 14 15 16 17 18 19 20 21 22 23 24
Frequency / Hz
5000
e
4000 3000 2000 1000 0 0
5
10
15
20
25
Unit number
Figure 6: Results after a longer learning process: The tentative results turned out to be unstable in the long run. Notice that the diagrams a–d are in two parts. (a) The first basis vectors of the ASSOM units (b(i) 1 ). (b) The second basis vectors (i) (i) 2 (i) 2 (i) (i) (i) T (b(i) ). (c) a = (b ) + (b ) , where b = [b , . 2 1 1,1 . . , b1,64 ] , etc. (d) The response j 1j 2j
(i) of the units to different frequencies, Fj(i) = ( f1j(i) )2 +( f2j(i) )2 , where f(i) 1 and f2 are the
(i) Fourier transforms of the b(i) 1 and b2 , respectively. (e) The dominant frequencies of the units. The bars indicate the FWHM of the frequency responses F(i) .
1338
Teuvo Kohonen, Samuel Kaski, and Harri Lappalainen
a
b(1 )
b
b(2 )
c
a
i
i
(i)
d F
(i)
i=
a
1
2
3
4
5
6
7
8
9
10 11 12
b
(i) 1
b b c a
(i) 2
(i)
d F
(i)
i=
13 14 15 16 17 18 19 20 21 22 23 24
Frequency / Hz
5000 4000 3000 2000 1000
e
0 0
5
10
15
20
25
Unit number
Figure 7: The final results with speech waveforms, when the nonlinear dissipation effect was included. Notice that the diagrams a–d are in two parts. (a) The (i) first basis vectors of the ASSOM units (b(i) 1 ). (b) The second basis vectors (b2 ). (i) (i) 2 (i) 2 (c) aj = (b1j ) + (b2j ) . (d) The response of the units to different frequencies, (i) (i) Fj(i) = ( f1j(i) )2 + ( f2j(i) )2 , where f(i) 1 and f2 are the Fourier transforms of the b1 and
b(i) 2 , respectively. (e) The dominant frequencies of the units. The bars indicate the FWHM of the frequency responses F(i) .
Formation of Invariant-Feature Filters
a
b1
b
b2
1339
(i)
(i)
i= 1
a
b1
b
b2
2
3
4
5
6
7
8
9
10
11
12
14
15
16
17
18
19
20
21
22
23
24
(i)
(i)
i = 13
(i) Figure 8: (a) Basis vectors b(i) 1 and (b) basis vectors b2 of an ASSOM that is invariant to horizontal translations. Notice that the diagrams are in two parts. During learning the FWHM of the vertical gaussian used for weighting the inputs changed from 0.2 to 1.0.
4.3 Formation of Various Invariant Visual-Feature Filters. The environment—the transforms that modify the inputs—determines what kinds of filters emerge in the learning process. In the previous section, where the input consisted of translated speech signals, the basis vectors developed into wavelet-like templates. In the following experiments, we show that different types of invariant visual-feature filters are learned when the input is translated horizontally, rotated, or scaled. Since the system model was the same all the time and the input samples in all of these experiments consisted of the same type of colored noise, the differences in the resulting filters are totally determined by the characteristics of the transformation type. A total of T = 30,000 episodes, consisting of five transformed sample patterns each, were presented. The value of the dissipation constant δ was 0.0001. If the inputs are translated randomly in the image plane, in both the horizontal and the vertical directions, basis vectors resembling two dimensional Gabor functions will emerge (Kohonen, 1996). Here we show for comparison that if the translation is mainly restricted to one direction, say horizontal, then the filters will become invariant to movements in that direction. Filters that have emerged when the input was translated horizontally and thereby weighted with a vertical gaussian to eliminate the asymmetries (see section 3.2) are shown in Figure 8. The basis vectors are sinusoidally modulated in the horizontal direction, and the two basis vectors in each unit have a 90-degree phase shift with each other.
1340
Teuvo Kohonen, Samuel Kaski, and Harri Lappalainen
a
b1
b
b2
(i)
(i)
i= 1
a
b1
b
b2
2
3
4
5
6
7
8
9
10
11
12
14
15
16
17
18
19
20
21
22
23
24
(i)
(i)
i = 13
(i) Figure 9: (a) Basis vectors b(i) 1 and (b) basis vectors b2 of the rotation-invariant ASSOM. Notice that the diagrams are in two parts.
In an environment where differently rotated input patterns are present, sinusoidally modulated basis vectors again emerge. This time the modulation occurs in the azimuthal direction (see Figure 9). The phases of the two (i) basis vectors b(i) 1 and b2 have a 90-degree phase shift with respect to each other, whereby the responses of the units are invariant to rotation of the inputs. The ASSOM subspaces can readily represent invariance classes of linear transformations, like rotation and translation. Also, nonlinear transformations can be approximately represented linearly, the better the smaller the amount of transformation. Such linear approximations to a nonlinear operation emerge in the ASSOM when the inputs are scaled with respect to the center of the sampling grid. The basis vectors acquire a sinusoidal modulation in the radial direction outward from the scaling center (see Figure 10). The phases of the two basis vectors in each unit again have a 90-degree phase shift with each other, whereby the responses of the units are invariant to the changing scale of the inputs. 4.4 Competition on Different Transformations. If the input patterns are subjected to several transformation types, each of them active during different episodes, the ASSOM units will also compete on these different transformations. Due to the neighborhood interactions during learning, neighboring units will become invariant to similar transformations, but also areas that are devoted to different transformations will emerge. To demonstrate the formation of areas specializing to different transformation types, we transformed the input patterns using three different transformation types—horizontal translation, rotation, and scaling—during dif-
Formation of Invariant-Feature Filters
a
b1
b
b2
1341
(i)
(i)
i= 1
a
b1
b
b2
2
3
4
5
6
7
8
9
10
11
12
14
15
16
17
18
19
20
21
22
23
24
(i)
(i)
i = 13
(i) Figure 10: (a) Basis vectors b(i) 1 and (b) basis vectors b2 of the scale-invariant ASSOM. Notice that the diagrams are in two parts.
ferent episodes. The transformations were defined precisely as in the previous section, except that no gaussian weighting was applied to the inputs. Vertical centering of the templates during horizontal translation was favored by varying the tilt of the pattern with respect to the sampling grid after each translation by a random amount (uniformly distributed between −0.5 and 0.5 radians). The ASSOM lattice consisted of 48 units, and the radius of the neighborhood kernel decreased from 14 units to 1 during learning. Otherwise the learning proceeded exactly as before. After the learning process, three distinct areas can be discerned in the ASSOM (see Figure 11). The units in the first row have mainly become invariant to scaling, and the units in the last row mainly to rotation. The units in the middle of the ASSOM, especially units 25 to 30, are best described as being invariant to horizontal translations. 5 Discussion In this article we demonstrated that invariant-feature filters and areas specializing in different transformations can emerge in the ASSOM as a result of Hebbian-type competitive learning. The learning rule was derived by minimizing an error function with the Robbins-Monro stochastic approximation. The objective function describes the average expected error of a random sequence of input samples compared with the corresponding best “neural” representation (subspace of the best-matching processing module). With the simple nonlinear modifications of the learning rule described in section 3.2, the invariant-feature filters were ordered robustly and reliably
1342
Teuvo Kohonen, Samuel Kaski, and Harri Lappalainen
a
b1
b
b2
(i)
(i)
i=
1
2
3
4
5
6
7
8
9
10
11
12
i = 13
14
15
16
17
18
19
20
21
22
23
24
26
27
28
29
30
31
32
33
34
35
36
38
39
40
41
42
43
44
45
46
47
48
a
b1
b
b2
(i)
(i)
a
b1
b
b2
(i)
(i)
i = 25
a
b1
b
b2
(i)
(i)
i = 37
Figure 11: Different areas on an ASSOM have become invariant to different transformations, when the input consisted of horizontally translated, rotated, and scaled patterns. (a) The first basis vectors of the ASSOM units (b(i) 1 ). (b) The second basis vectors (b(i) ). Notice that the diagrams are in four parts. 2
according to the similarity of the transformations. The capability of the network to learn invariant features was demonstrated for both natural speech signals and broadband random images. Each processing module in the network can be made to become invariant to one transformation type and decode a certain range of features invariantly of this transformation. Different competing modules specialize in representing different kinds of invariant features. The network can thus function as a learning preprocessing stage for invariant-feature extraction. There were two basis vectors in the neural modules in our experiments.
Formation of Invariant-Feature Filters
1343
The resulting two-dimensional subspaces then corresponded to those of the basic wavelet filters. However, there is no restriction on the number of basis vectors in the algorithm, and subspaces of larger dimensions can readily be used to represent more complex features. An intriguing possibility that the neurons in the ASSOM architecture might have biological counterparts ensues from the forms and operation of the filters in the input and output layers. The receptive fields of the simple cells in the mammalian primary visual cortex have been hypothesized to be describable by Gabor-type functions (Daugman, 1980, 1985; Marcelja, 1980; Jones & Palmer, 1987), and the ASSOM process also produces Gabortype filters in the first-layer neurons. The output-layer neurons respond invariantly to moving and transforming targets and in this sense resemble the complex cells. Acknowledgments We thank Professor Erkki Oja for useful discussions. References Albert, A. E., & Gardner, L. A., Jr. (1967). Stochastic approximation and nonlinear regression. Cambridge, MA: MIT Press. Buhmann, J., & Kuhnel, ¨ H. (1993). Vector quantization with complexity costs. IEEE Trans. Information Theory 39:1133–1145. Cichocki, A., & Unbehauen, R. (1993). Neural networks for optimization and signal processing. New York: Wiley. Daubechies, I. (1990). The wavelet transform, time-frequency localization and signal analysis. IEEE Trans. Information Theory 36:961–1005. Daugman, J. (1980). Two-dimensional spectral analysis of cortical receptive field profiles. Vision Res. 20:847–856. Daugman, J. (1985). Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. J. Opt. Soc. Am. A 2:1160–1169. Foldi´ ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Comp. 3:194–200. Gabor, D. (1946). Theory of communication. J. IEE 93:429–457. Gersho, A. (1979). Asymptotically optimal block quantization. IEEE Trans. Information Theory 25:373–380. Gray, R. M. (April, 1984). Vector quantization. IEEE ASSP Magazine, pp. 4–29. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. J. Educat. Psych. 24:498–520. Jones, J., & Palmer, L. (1987). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol. 58:1233– 1258. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biol. Cybern. 43:59–69.
1344
Teuvo Kohonen, Samuel Kaski, and Harri Lappalainen
Kohonen, T. (1989). Self-organization and associative memory (3rd ed.). Berlin: Springer-Verlag. Kohonen, T. (1991). Self-organizing maps: Optimization approaches. In T. Kohonen, K. M¨akisara, O. Simula, & J. Kangas (Eds.), Artificial Neural Networks (Vol. 2, pp. 981–990). Amsterdam: North-Holland. Kohonen, T. (1995a). The adaptive-subspace SOM (ASSOM) and its use for the implementation of invariant feature detection. In F. Fogelman-Souli´e & P. Gallinari (Eds.), Proc. ICANN’95, Int. Conf. on Artificial Neural Networks (Vol. 1, pp. 3–10). Paris: EC2 et Cie. Kohonen, T. (1995b). Emergence of invariant-feature detectors in selforganization. In M. Palaniswami, Y. Attikiouzel, R. J. Marks II, D. Fogel, and T. Fukuda (Eds.), Computational Intelligence. A Dynamic System Perspective (pp. 17–31). New York: IEEE Press. Kohonen, T. (1995c). Self-organizing maps. Berlin: Springer-Verlag. Kohonen, T. (1996). Emergence of invariant-feature detectors in the adaptivesubspace SOM. Biol. Cybern. 75:281–291. Kushner, H. J., & Clark, D. S. (1978). Stochastic approximation methods for constrained and unconstrained systems. New York: Springer-Verlag. Luttrell, S. P. (1994). A Bayesian analysis of self-organizing maps. Neural Comp. 6:767–794. Makhoul, J., Roucos, S., & Gish, H. (1985). Vector quantization in speech coding. Proc. IEEE 73:1551–1588. Marcelja, S. (1980). Mathematical description of the responses of simple cortical cells. J. Opt. Soc. Am. 70:1297–1300. Oja, E. (1983). Subspace methods of pattern recognition. Letchworth, England: Research Studies Press. Oja, E. (1992). Principal components, minor components, and linear neural networks. Neural Networks 5:927–935. Robbins, H., & Monro, S. (1951). A stochastic approximation method. Ann. Math. Statist. 22:400–407. Rubner, J., & Tavan, P. (1989). A self-organizing network for principalcomponent analysis. Europhys. Lett. 10:693–698. TIMIT. (1988). CD-ROM prototype version of the DARPA TIMIT acousticphonetic speech database. Wallis, G., Rolls, E., & Foldi´ ¨ ak, P. (1993). Learning invariant responses to the natural transformations of objects. In Proc. IJCNN’93, Int. Joint Conf. on Neural Networks, Nagoya (pp. 1087–1090). Piscataway, NJ: IEEE Service Center. Wasan, M. T. (1969). Stochastic approximation. Cambridge: Cambridge University Press.
Received April 19, 1996; accepted November 1, 1996.
Communicated by Helge Ritter
Self-Organizing Map with Dynamical Node Splitting: Application to Handwritten Digit Recognition∗ Sung-Bae Cho Department of Computer Science, Yonsei University, Seoul, 120–749 Korea
This article presents a simple yet elegant pattern recognizer based on a dynamic node-splitting scheme for the self-organizing map that can adapt its structure as well as its weights. The scheme makes use of a structure adaptation capability to place the nodes of prototype vectors into the pattern space accurately so as to make the decision boundaries as close to the class boundaries as possible. In order to show the performance of the proposed scheme, experiments with the unconstrained handwritten digit database of Concordia University in Canada were conducted. The proposed method for an incremental formation of feature maps is 96.05 percent of the recognition rate. In view of the elegant simplicity of the approach, the reported performance is remarkable and can stand up to one of the best results reported in the literature with the same database. 1 Introduction A wide variety of methods have been proposed to realize the perfect recognizer of handwritten digits by computer. Many systems have been developed, but more work is required to be able to match human performance (Suen, Nadal, Legault, Mai, & Lam, 1992). Recently the emerging technology of neural networks has been used to design a pattern recognizer that matches human ability. Among several models, the multilayer Perceptron has been recognized as a powerful tool for pattern classification problems. Its strength lies in the discriminative power and capability of learning and representing implicit knowledge, but there are several difficulties in solving real-world problems. One of the shortcomings is how to determine the size and structure of the network. To overcome this difficulty, several approaches based on the structure adaptation of the networks have been proposed (Koikkalainen & Oja, 1990; Sanger, 1991; Fritzke, 1994; Li, Tang, & Fang, 1995).1 ∗ An abstracted version of this article was presented at the 13th International Conference on Pattern Recognition, Vienna, 1996. 1 In addition, many papers related to this topic have been presented at several international conferences. Representative of them are Fritzke (1994), for a general approach to structure-adaptation of neural networks; Koikkalainen and Oja (1990) and Sanger (1991) for a typical tree-structured neural network; and Li et al. (1995) for another tree-structured neural network applied to a pattern recognition problem.
Neural Computation 9, 1345–1355 (1997)
c 1997 Massachusetts Institute of Technology °
1346
Sung-Bae Cho
In this article we propose an efficient pattern recognizer based on a dynamic node-splitting scheme for the self-organizing map (SOM) that can adjust its structure as well as its weights during learning. The network utilizes the structure-adaptation capability to place the nodes of prototype vectors into the pattern space accurately so as to make the decision boundaries as close to the class boundaries as possible. This approach shapes the general structure-adaptive SOM into a pattern recognizer by splitting a node representing more than one class into a submap (composed of four nodes). It also allows an incremental formation of feature maps in the real-world problem of digit recognition. We will show how a network is properly constructed in order to solve the problem of handwritten digit recognition. 2 Preprocessing 2.1 The Database. The handwritten digit database of Concordia University of Canada consists of 6000 unconstrained digits originally collected from dead letter envelopes by the U.S. Postal Service at different locations in the United States. The digits of this database were digitized in bilevel on a 64 × 224 grid of 0.153 mm square elements, giving a resolution of approximately 166 pixels per inch (Suen, Nadel, Mai, Legualt, & Lam, 1990). Among the data, 4000 digits were used for training and 2000 digits for testing. The representative samples in Figure 1 taken from the database show that many different writing styles are apparent, as well as digits of different sizes and stroke widths. 2.2 Feature Extraction. Digits, whether handwritten or typed, are essentially line drawings—one-dimensional structures in a two-dimensional space. Thus, local detection of line segments might be an adequate method for extracting features. For each location in the image, information about the presence of a line segment at a given direction is stored in a feature map (Knerr, Personnaz, & Dreyfus, 1992). In this article Kirsch masks have been used for extracting directional features (Kim & Lee, 1994). Kirsch defined a nonlinear edge enhancement algorithm as follows (Pratt, 1978): ¾ ½ 7 G(i, j) = max 1, max[|5Sk − 3Tk |] k=0
where Sk = Ak + Ak+1 + Ak+2 Tk = Ak+3 + Ak+4 + Ak+5 + Ak+6 + Ak+7 . Here, G(i, j) is the gradient of pixel (i, j), the subscripts of A are evaluated modulo 8, and Ak (k = 0, 1, . . . , 7) is eight neighbors of pixel (i, j), defined as shown in Figure 2.
Self-Organizing Map with Dynamical Node Splitting
1347
(a)
(b)
Figure 1: Sample data. (a) Training. (b) Test.
A0
A1
A2
A7 (i, j ) A3 A6
A5
A4
Figure 2: Definition of eight neighbors Ak (k = 0, 1, . . . , 7) of pixel (i, j).
Since the data set was prepared by thorough preprocessing, each digit is scaled to fit in a 16 × 16 bounding box such that the aspect ratio of the image is preserved. Then feature vectors for vertical, horizontal, left-diagonal, and right-diagonal directions are obtained from the scaled image (Kim & Lee, 1994): GV (i, j) = max(|5S2 − 3T2 |, |5S6 − 3T6 |) GH (i, j) = max(|5S0 − 3T0 |, |5S4 − 3T4 |)
1348
Sung-Bae Cho
GL (i, j) = max(|5S3 − 3T3 |, |5S7 − 3T7 |) GR (i, j) = max(|5S1 − 3T1 |, |5S5 − 3T5 |) The final step in extracting the features compresses each 16 × 16 directional vector into a 4 × 4 one with averaging operator, which produces a value for 2 × 2 pixels with dividing by four the value obtained by summing the four values. In addition to the directional features, 4 × 4 compressed images have been used as global features. As a result, final features consist of five 4 × 4 features: four 4 × 4 local and one 4 × 4 global features. 3 The Method In this section, we present the dynamic node-splitting scheme that can simultaneously determine a suitable number of nodes and the connection weights between input and output nodes in a SOM. The basic idea is simple: 1. Start with a basic neural network (in our case, a 4 × 4 map in which each node is fully connected to all input nodes). 2. Train the network with Kohonen’s algorithm (Kohonen, 1990). 3. Calibrate the network using known input-output patterns to determine which node should be replaced with a submap of several nodes (in our case, a 2 × 2 map) and which node should be deleted. 4. Unless every node represents a unique class, go back to step 2. Note that step 3 adjusts the structure of the map so that each node represents a unique label for the classification. In our scheme, the weights of a new node are initialized by interpolating the weights of neighboring nodes. 3.1 Basic Structure. The structure of the network is quite similar to Kohonen’s SOM except for the irregular connectivity in the map. Figure 3 shows an instance of the network where each node represents a unique class. Each node is connected to all the input nodes with corresponding weights. (Actually, this is the final network structure obtained for recognizing the handwritten digits in the simulation.) The initial map of the network consists of 4 × 4 nodes. The weight vector of node i shall be denoted by wi ∈ Rn . The simplest analytical measure is the Euclidean distance between x and wi . The minimum distance defines the winner wc . If we define a neighborhood set Nc around node c, at each learning step all the nodes within Nc are updated, whereas nodes outside Nc are left intact. This neighborhood is centered around that node for which the best match with input x is found as: ||x − wc || = min{||x − wi ||}. i
Self-Organizing Map with Dynamical Node Splitting
4
7
7
1
1
7
9
9
8
9
9
4
9
4
8
1
1 1
8
4
4
9
7
9
9
4
7
9
9
4
8
8
8
5
5
5
3
8
8
0
8
8 1
1
6
6
4
8
9
7
9
0
0
0
6
3
5
8
5
8
1349
3
6
3
2
3
2
2
2
2
3
5
3
5
2
0
6
0
6
Figure 3: The proposed neural network. The number in each circle is the class that the corresponding node represents.
The width or radius of Nc can be time variable. For a good global ordering, it is advantageous to let Nc be very wide in the beginning and shrink monotonically with time (Kohonen, 1990). The updating process may read ( wi (t + 1) =
wi (t) + α(t)[x(t) − wi (t)]
if i ∈ Nc (t),
wi (t)
if i ∈ / Nc (t),
where α(t) is a learning rate, 0 < α(t) < 1. 3.2 Splitting Nodes. After a constant number of adaptation steps, a node representing more than one class is replaced with several nodes. (We used a submap of 2 × 2 nodes.) Obviously this node lies in a region of the input vector space where many misclassifications occur. If input patterns from different classes are covered by the same local node and activate this node to about the same degree, it might be the case where their vectors of local node activations are nearly identical. Figure 4 shows how the network structure changes as some nodes representing duplicated classes are replaced by several nodes having finer resolution.
1350
7
9
4, 7, 8, 9
1, 4, 8
Sung-Bae Cho
4, 7, 9
4
3, 5, 8
5, 8
0, 3, 5
5, 8
0, 1, 6, 8
6
(a)
6
2, 3
4, 9
4
7, 9
4, 9
4
7
6
8
8
8
3
2, 3
2
8
5
5, 8
3
2
2
9
7
7, 8, 9
5
5
3
3, 5
7
4, 9
8
8
0
0
1
1, 4, 8
8
0, 6
1
1
1, 6
6
2
2
0
6
0
(b)
Figure 4: Map configurations changed through learning. (a) Initial status. (b) Intermediate status. (c) Final status.
3.3 Removing Nodes. The previous section shows how to extend the network structure. A necessary consequence is that all the nodes are connected directly or indirectly to each other. However, a problem may occur if the pattern space we try to discriminate has some disconnected regions. A solution can be found by introducing the deletion of nodes from the structure. An obvious criterion for a node to be deleted would be that it has a position in an area of the Rn where the probability density is zero (Fritzke, 1994). For this purpose, nodes that have been inactive longer than a specified length of time can be deleted. In the previous example, only one node is deleted at the final map (see Figure 4c).
Self-Organizing Map with Dynamical Node Splitting
1351
Table 1: Comparisons of the Proposed Method with Alternative Methods (%)
Methods
Pixels Correct Error Reject Training Testing per Inch
Lam & Suen, 1988 Nadal & Suen, 1988 Legault & Suen, 1989 Krzyzak, Dai, & Suen, 1990 Krzyzak, Dai, & Suen, 1990 Mai & Suen, 1990 Suen et al., 1990 Le Cun et al., 1990 Martin & Pittman, 1990 Cohen, Hull, & Srihari, 1991 Cohen et al., 1991 Knerr et al., 1992 Lemarie, 1993 Kim & Lee, 1994 Kim & Lee, 1994
93.10 86.05 93.90 86.40 94.85 92.95 93.05 96.40 96.00 95.54 97.10 90.30 97.97 95.40 95.85
2.95 2.25 1.60 1.00 5.15 2.15 0.00 3.40 4.00 4.46 2.90 9.70 2.03 4.60 4.15
3.95 11.70 4.50 12.60 0.00 4.90 6.95 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00
4000 4000 4000 4000 4000 4000 4000 7291 35200
7200 8783 4000 4000
2000 2000 2000 2000 2000 2000 2000 2007 4000 2711 1762 1800 7394 2000 2000
Proposed method
96.05
3.95
0.00
4000
2000
166 166 166 166 166 166 166
300 300
166 166 166
4 Experimental Results After training the proposed network with 4000 handwritten digits of the database, the recognition rate on the 2000 test data has been investigated. Table 1 shows the performances of the proposed method together with the results reported by some alternative methods. It also provides information about the size of the data sets used for training and testing, along with the scanning resolution in pixels per inch. Although some of the previous methods have achieved better recognition results, they used a highly tuned architecture or a much larger training data set than used here. The error rate of the proposed method is 3.95 percent, and in view of the elegant simplicity of the approach, this performance is remarkable and can stand comparison with one of the best results reported in the literature. Previous work (Le Cun et al., 1990) showed that good generalization can be obtained only by designing a network architecture that contains a certain amount of a priori knowledge about the problem. The basic design principle is to minimize the number of free parameters that must be determined by the learning algorithm without overly reducing the computational power of the network. This principle increases the probability of correct generalization because it results in a specialized network architecture that has a reduced entropy. The proposed method of dynamic node splitting provides the automatic determination of the architecture to fit the problem at hand. A thorough analysis on the recognition results reveals that most of the confusion makes sense. For example, class 0 has three instances of misclas-
1352
Sung-Bae Cho
sification (2, 6 and 8), all of which are neighbors to the correct node in the map produced by the proposed scheme obtained in the simulation (see Figures 4c and 5). Figure 5 shows in detail how the topological orderings in pattern classes are preserved in the final map. As can be seen, the nodes for the same or similar classes form a sort of cluster topologically close to each other. For example, the class 9 consists of two clusters: one close to the class 4 and the other close to the class 7. There is strong evidence that the recognizer made by the proposed neural network preserves the topological ordering of the input patterns of the handwritten digits. This can be of relative superiority to the conventional tree-structured approach to the node-splitting scheme. In order to improve the performance, we are attempting to incorporate the concept of k-nearest neighbor rule into the decision of the final class. Unfortunately the results are not exceptionally reliable. Work in progress seeks to improve reliability by introducing the rejection criteria to the decision process. 5 Concluding Remarks This article proposes an elegant self-organizing neural network that might solve the complex classification problem. The key advantage of the network is that it automatically finds a network structure and size suitable for the classification of complex patterns through the ability of structure adaptation. Experimental results with handwritten digits have revealed that the proposed network efficiently and accurately classifies real patterns with high variations. Handwritten digit recognition is not a new subject, and many good approaches have been published. The overwhelming majority of these approaches, however, use classical pattern recognition methods or backpropagation networks. This article demonstrates that very competitive recognition results can be obtained with the self-organizing map if it is suitably combined with a dynamic node-splitting scheme. Research is continuing with more difficult tasks such as handwritten Hangul (Korean script) recognition. Acknowledgments This work was supported in part by the offline character recognition consortium at KAIST (Korea Advanced Institute of Science and Technology). I thank Ari Milgram for his work on drafts of this article. References Cohen, E., Hull, J. J., & Srihari, S. N. (1991). Understanding handwritten text in a structured environment: Determining ZIP codes from addresses. Int. J. Pattern Recognition and Artificial Intelligence 5 (1&2): 221–264.
Self-Organizing Map with Dynamical Node Splitting
4
7
8
1
9
8
9
9
4
9
4
8
1
1
1
4
7
4
4
9
7
9
9
4
7
9
9
4
8
5
5
5
3
7
8
8
0
9
8
9
9
4
9 8
4
1
1
1
5
1 1
8
8
0 1
6
6
8
7
1
1
9
8 9
4
9
4
8
1
1 1
8
8 3 3
8
6
6
0
0
0
6
3
9
4
7
9
9
4
8
8
0
1
0
4
2
3
2
9
8
9
9
4
9 8
4 1
5
2 7
1
6
9
8
9
9
4
9
4
8
1
1
4
7
8
1
1
9
9
4
9
4
8
1
1 1
8
9
4
4
9
7
9
9
4
7
9
9
4
8
1
8
5
8
8
5
5
5
3
8
8
0
8
8 1
1
6
6
0
0
0
6
3
5
9
9
4
3
8
8
0
8
6
6
8
5
5
3
8
8
0
5
8
6
5
3
5
3
3
6
3
2
9
8
9
9
4
9
4
7
9
9
4
0
8
4 1
3
5
3
8
8
0
8
1
6
6
8 5
2 7
1
6
9
8
9
9
4
9
4
8
1
1
4
7
1
2
2 7
1
8
9
9
4
9
4
8
1
1
0 1
1
3
5
8
2
0
6
0
9
4
4
9
7
9
9
4
7
9
9
4
8
8
5
5
5
3
8
8
0
9
4
4
9
7
9
9
4
7
9
9
4
8
4 6
8
5
8
8
5
5
5
3
8
8
0
8
8 1
6
6
3 8
5
8
6
3
3
3
5
6 6 6
2
2
3
5
3
5
2
2
0
6 6
2
0
2
2
2
5
6
3
2
2 2
0
0 0
1
6
3
3
0
0 0
1
3 8
5
2
4
8 5
8
1
2
5
6
8
1
3 8
2
9
3
2
2 2
6
8
8
0
6 6
9
8
2
3
3
9
7 7
7
3 8
5
4
2
5
6
0
0 0
1
2
2 2
0
7
4
3
0
6
9
5
7
5
3
9
5
9
8
4
7
4
2
3
3
9
8
8
1
6
3 8
9
8
4
8
9
7
8
8
1
0
9
4
4
9
7 7
7
9
9
3
2
5
9
8
0
4
2
2
0
4
8 0
0
6 6
5
7
5
2
9
4
3
8
5 0
7
2
9
5
2
5
0
0 0
6
3
2
2 2
1
5
2
3
8
2
3
3
1
5
6
5
1
6
5
6
3 8
5
1
6
8
0
6 6
8
9 3
6
2
3
3
8
8
1
2
4
6
8
2
9
4
6
3 8
8
1
3
9
0
0 0
1
3
4
7
5
4
8
9
7
7
5
8
1
9
4
9
0
6 6
8
4
5
4
7
9
9
5
8
2
7
9
9
5
1
2
2
9
8
8
3
2
7
9
2
3
9
7
8
8
1
4
8
7
6
0
7
1
7
8
8
1 1
5
3
8
1
4
8
9
8
4
9
9
7
3
9
3 2
7
3 8
8 9
4
1
4
8
9 9
0
6
0
0
7
0
9
3
2
6
4
4
9
7 7
0
7
9
5
2
2
5
9
3 8
2
7
2
5
4
5
2
5
3
0
7
5
6
0
8
5
4
5
6
8
2
9
5
1
8
2
3
3
1
8
8
3
5
3
9
1
8
1
5
1
6
8
8
3 8
9
9
9
5
5
1
6
9
7
4
4
7
6
2
3
3
3
7
9
6
4
7
9
0
0
8 1
7
6
3 8
9
7 7
4
4
4
8 5
9
5
8
3 8
8
9
9
8
8
1
9
9
7
9
7
7
4
4
8
9
7
9
1353
0
6
Figure 5: Maps preserving the topological orderings in the pattern classes.
0
1354
Sung-Bae Cho
Fritzke, B. (1994). Growing cell structures–a self-organizing network for unsupervised and supervised learning. Neural Networks 7 (9):1441–1460. Kim, Y. J., & Lee, S.W. (1994). Off-line recognition of unconstrained handwritten digits using multilayer backpropagation neural network combined with genetic algorithm. In Proc. 6th Workshop on Image Processing and Understanding (pp. 186–193). (In Korean.) Knerr, S., Personnaz, L., & Dreyfus, G. (1992). Handwritten digit recognition by neural networks with single-layer training. IEEE Trans. on Neural Networks 3 (6):962–968. Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78 (9):1464– 1480. Koikkalainen, P., & Oja, E. (1990). Self-organizing hierarchical feature maps. In Proc. Int. Joint Conf. Neural Networks (pp. 279–284). San Diego, CA. Krzyzak, A., Dai, W., & Suen, C. Y. (1990). Unconstrained handwritten character classification using modified backpropagation model. In Proc. 1st Int. Workshop on Frontiers in Handwriting Recognition (pp. 155–166). Montreal, Canada. Lam, L., & Suen, C. Y. (1988). Structural classification and relaxation matching of totally unconstrained handwritten ZIP-code numbers. Pattern Recognition 21 (1):19–31. Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1990). Handwritten digit recognition with a back-propagation network. Advances in Neural Information Processing Systems 2:396–404. Legault, R., & Suen, C. Y. (1989). Contour tracing and parametric approximations for digitized patterns. In A. Krzyzak, T. Kasvand & C. Y. Suen (Eds.), Computer Vision and Shape Recognition (pp. 225–240), Singapore: World Scientific Publishing. Lemarie, B. (1993). Practical implementation of a radial basis function network for handwritten digit recognition. In Proc. 2nd Int. Conf. Document Analysis and Recognition (pp. 412–415). Tsukuba, Japan. Li, T., Tang, Y. Y., & Fang, L. Y. (1995). A structure-parameter-adaptive (SPA) neural tree for the recognition of large character set. Pattern Recognition 28 (3):315–329. Mai, T., & Suen, C. Y. (1990). A generalized knowledge-based system for the recognition of unconstrained handwritten numerals. IEEE Trans. on Systems, Man and Cybernetics 20 (4):835–848. Martin, G. L. & Pittman, J. A. (1990). Recognizing hand-printed letters and digits. Advances in Neural Information Processing Systems 2:405–414. Nadal, C. & Suen, C. Y. (1988). Recognition of totally unconstrained handwritten digit by decomposition and vectorization. (Tech. Rep.) Concordia University, Montreal, Canada. Pratt, W. K. (1978). Digital image processing, New York: Wiley. Sanger, T. D. (1991). A tree-structured adaptive network for function approximation in high-dimensional spaces. IEEE Trans. on Neural Networks 2 (2): 285–293. Suen, C. Y., Nadal, C., Legault, R., Mai, T. A., & Lam, L. (1992). Computer recognition of unconstrained handwritten numerals. Proceedings of the IEEE 80 (7):1162–1180.
Self-Organizing Map with Dynamical Node Splitting
1355
Suen, C. Y., Nadal, C., Mai, T., Legault, R., & Lam, L. (1990). Recognition of handwritten numerals based on the concept of multiple experts. In Proc. 1st Int. Workshop Frontiers in Handwriting Recognition (pp. 131–144). Montreal, Canada.
Received March 13, 1996; accepted October 1, 1996.
Communicated by Kim Plunkett
Information Controller to Maximize and Minimize Information Ryotaro Kamimura Information Science Laboratory, Tokai University, Kanagawa 259-12, Japan
We propose an information controller to maximize and simultaneously minimize information in neural networks. The controller can be used to improve generalization performance and interpret explicitly internal representations. An apparently contradictory operation of the simultaneous maximization and minimization is possible by assuming two kinds of information: collective information and individual information, representing collective activities of hidden units and activities of individual input-hidden connections, respectively. By maximizing the collective information and minimizing the individual information, simple networks can be generated in terms of the number of connections and number of hidden units. This method was applied to the inference of the maximum onset principle and the sonority principle of artificial languages. In these problems, it was shown that the individual information could be sufficiently minimized and simultaneously the collective information was maximized. Experimental results confirmed improved generalization performance by the suppression of overtraining. In addition, internal representations created by information maximization and minimization could easily be interpreted. 1 Introduction An increasing number of studies have been conducted to formulate neural learning from the information-theoretic point of view (e.g., Atick, 1992; Barlow, 1989; Barlow, Kaushal, & Mitchison, 1989; Bridle, MacKay, & Heading, 1992; Linsker, 1988, 1989a, 1989b). Recently, information-theoretic supervised learning methods have been developed to control internal representations created by neural networks (Deco, Finnof, & Zimmermann, 1993, 1995; Kamimura, 1992, 1994; Kamimura, Takagi, & Nakanishi, 1995; Kamimura & Nakashini, 1995). In these methods, information is defined by the activity of hidden units. Thus, the methods aim to control hidden unit activity patterns in an optimal manner. Depending on the types of problems, information is maximized or minimized. The information maximization methods have been used to interpret explicitly internal representations and simultaneously to reduce the number of necessary hidden units (Kamimura & Nakanishi, 1995). On the other hand, information minimization methods Neural Computation 9, 1357–1380 (1997)
c 1997 Massachusetts Institute of Technology °
1358
Ryotaro Kamimura
have been especially used to improve generalization performance (Deco et al., 1995; Kamimura et al., 1995) and speed up learning (Akiyama & Furuya, 1991). Thus, if it is possible to maximize and minimize information simultaneously, the information-theoretic methods are expected to be applied to a wide range of problems. In this article, we unify the information maximization and minimization methods, into one framework to improve generalization performance and interpret explicitly internal representations. However, it is apparently impossible to maximize and minimize simultaneously the information defined by the hidden unit activity. Our goal is to maximize and minimize information at two different levels: collective and individual. This means that information can be maximized in collective ways, and information is minimized for individual input-hidden connections. The seeming contradictory proposition of the simultaneous information maximization and minimization can be overcome by assuming the existence of the two levels for the information control. The collective information is defined by using the hidden unit activity, which has been used extensively in the past information-theoretic approach (Deco et al., 1995; Kamimura & Nakanishi, 1995). If this information is maximized, only one hidden unit is strongly activated, while all the other hidden units are approximately off. This effect is related to the network size reduction in terms of the hidden units and the explicit interpretation of internal representations. On the other hand, the individual information is defined by the strength of each input-hidden connection. If the individual information is minimized, input-hidden connections are close to zero. This means that the individual information minimization has the same effect as weight decay and weight elimination. Thus, individual information minimization is expected to show better generalization performance. The information is supposed to be controlled by an information controller located outside neural networks and used exclusively to control the information. By assuming the information controller, we can see clearly how the information, appropriately defined, can be maximized or minimized. In addition, the actual implementation of the information method is much easier by proposing the concept of the information controller. The collective and individual information is defined by using the concept of mutual information between input patterns and hidden units, as formulated in Deco et al. (1993, 1995). The introduction of the concept of the mutual information makes it possible to interpret the meaning of the information itself and to see the collective and individual information from a more unified point of view. For example, the collective information is considered to be the information contained in hidden units. By this information, we can infer which input patterns are given to input units. In this study, we applied our methods to the acquisition of rules in an artificial language that are complex enough to test the performance of the methods. Experimental results confirmed that the collective information can
Information Controller
1359
be maximized and the individual information simultaneously minimized. By maximizing the information, the number of necessary hidden units could be reduced, and the remaining hidden units were explicitly interpreted. In addition, by minimizing the individual information, almost all the inputhidden connections were pushed toward zero, leading networks to better generalization performance. The generalization performance was well superior to several traditional methods used in improving this performance. This article is organized as follows. Section 2 outlines the actual meaning of the information maximization and minimization methods and how to unify these two methods. In section 3, after briefly introducing the concept of the mutual information, we formulate the learning that maximizes and minimizes information. In section 4, we apply the information controller to a problem of syllabification: the detection of the maximum onset principle. It will be shown that information maximization and the information minimization can generate explicit internal representations. In section 5, we apply the method to the inference of the maximum onset principle with the sonority principle. It will be shown that generalization performance can be improved significantly by information maximization and minimization. 2 Functions of Information Controller The goal of this study is to unify information maximization and minimization to improve generalization performance and interpret internal representations explicitly. This section outlines functions of the information controller and explains how the controller can achieve information maximization and minimization. 2.1 Information Maximization and Minimization. To explain the functions of the information controller, it is necessary to understand the basic properties of information maximization and minimization. When the information defined by hidden units is maximized, only one hidden unit is strongly activated, while all the other hidden units are inactive. As shown in Figure 1b, multiple strongly negative input-hidden connections (the thick dotted lines) are generated by a process of information maximization. In this case, it is easy to interpret the roles of hidden units and the meaning of internal representations. However, improved generalization performance cannot be expected, especially for complex problems, because of strongly negative connections and responses to input patterns that are too explicit. Second, when the information defined for hidden units is minimized, all the hidden units are equally activated. One fundamental characteristic of information minimization is that the majority of input-hidden units tend to be close to zero, as shown in Figure 1a. This means that information minimization performs the same operation as weight decay and weight elimination. Thus, information minimization has been used especially to
1360
Ryotaro Kamimura
improve generalization performance. However, it is difficult to interpret internal representations because all the hidden units are equally activated. 2.2 Collective and Individual Information. Information maximization and minimization methods have been developed independently. However, if the information is maximized and simultaneously minimized, interpretation and generalization performance can be improved. A statement that the information is maximized and simultaneously minimized is actually a contradictory one. To dissolve the contradiction, it is necessary to understand the actual meaning of information minimization. Information minimization is approximately equivalent to weight decay and weight elimination. This means that as weights are pushed toward zero, information can be decreased. Thus, the contradictory statement can be restated as follows: If information for hidden units is maximized and simultaneously inputhidden connections can be decayed, we have the same effect of information maximization and information minimization. One of the possible ways to implement the idea of the unification is to introduce weight decay or weight elimination term in the framework of information maximization. Instead of using simple weight decay (Ishikawa & Yamamoto, 1994; Kroph & Hertz, 1991; Weigend, Rumelhart, & Huberman, 1992), we define individual information for each input-hidden connection in the framework of mutual information between input units and hidden units and minimize this individual information. In the formulation of the individual information, each input-hidden connection is treated separately and independently, and it is examined whether the corresponding hidden unit is fired or not by the input-hidden connection. If this information is minimized, each input unit behaves independent of the corresponding hidden unit. This means that the input unit is not related to the hidden unit. In addition, the probability of the firing of a hidden unit, given the firing of an input unit, can be approximated by the strength of the corresponding inputhidden connection. Thus, the actual meaning of the individual information minimization is the weight decay or the weight elimination. On the other hand, our previous definition of the information (Kamimura & Nakanishi, 1995) is called collective information, because all the inputhidden connections are treated collectively. This information is defined in the framework of the mutual information between input patterns and hidden units. Since the individual information is also defined in the framework of the mutual information, information maximization and minimization can be formulated in a more unified way. 2.3 Information Controller. Two kinds of information, collective information and individual information, are controlled by using an information controller. The information controller is devised to interpret the mechanism of information maximization and minimization more explicitly. As can be seen in Figure 1, the information controller is composed of two subcom-
Information Controller
1361
ponents: an individual information minimizer and a collective information maximizer. The collective information maximizer is used to increase the collective information as much as possible (see Figure 1b). The individual information minimizer is used to decrease the individual information as much as possible. By this minimization, the majority of connections are pushed toward zero. Eventually all the hidden units tend to be intermediately activated, as shown in Figure 1a. Figure 1c shows a case where the collective information maximizer and the individual information maximizer are simultaneously applied. In the figure, a hidden unit activity pattern is a pattern of the maximum information in which only one hidden unit is on, while all the other hidden units are off. However, the multiple strongly negative connections shown in Figure 1b to produce a maximum information state are replaced by extremely weak input-hidden connections, represented by thin dotted lines. Strongly negative connections are inhibited by the individual information minimization. This means that by the information controller, the information can be maximized and at the same time one of the most important properties of the information minimization, weight decay or weight elimination, can approximately be realized. Consequently, the information controller can generate much-simplified networks in terms of hidden units and input-hidden connections. 3 Formulation by Mutual Information We briefly introduce a concept of mutual and conditional mutual information. Then, by using the mutual information, we formulate an information controller that maximizes collective information and minimizes individual information. 3.1 Mutual and Conditional Mutual Information. We consider an information channel and define mutual and conditional mutual information used to define the collective and individual information (Abramson, 1963; Deco et al., 1995). Let us conceptually see a network behavior in terms of mutual information to interpret intuitively the mechanism of the mutual information. Consider a situation in which a probability of the occurrence of hidden units is determined by the occurrence of input patterns. Let B denote a set of hidden units B = {b1 , . . . , bM } considered to be a random variable. A probability of the occurrence of the jth hidden unit bj is supposed to be given by a probability p(bj ), and p(bj | as ) is a conditional probability, given the sth input pattern of a set of input patterns A = {a1 , . . . , aS }. Average uncertainty of hidden units B and input patterns A is represented by H(B) and H(A), respectively. The conditional uncertainty of B, given A, is represented in H(B | A). Then the mutual information between hidden units B and input
1362
Ryotaro Kamimura
Maximum Information
Minimum Information
(a)
(b)
Individual Information Minimizer
Information Controller
Collective Information Maximizer
Input-Hidden Connections Strongly Positive Strongly Negative Weak Controlled by Collective Information Maximizer (c)
Controlled by Individual Information Minimizer
Figure 1: An information controller used to maximize and simultaneously minimize information. The information controller is composed of an individual information minimizer and a collective information maximizer. The collective information maximizer strongly activates a few hidden units, and the individual information minimizer pushes the majority of connections toward zero. Thus, simplified networks can be obtained in terms of the number of hidden units and the number of connections.
patterns A can be defined by I(B; A) = H(B) − H(B | A) X =− p(bj ) log p(bj ) B
+
XX A
=
XX A
=
X A
p(as )p(bj | as ) log p(bj | as )
B
p(as )p(bj | as ) log
B
p(as )I(B; as ),
p(bj | as ) , p(bj ) (3.1)
Information Controller
1363
where I(B; as ) =
X B
=−
p(bj | as ) log
X
p(bj | as ) p(bj )
p(bj | as ) log p(bj ) +
B
X
p(bj | as ) log p(bj | as )
(3.2)
B
is the conditional mutual information, given the sth input pattern. This mutual information means the decrease of the uncertainty of the hidden units by knowing the occurrence of input patterns. Since the role of B and A can be interchangeable, that is, I(B; A) = I(A; B), the mutual information can be rewritten as I(A; B) = H(A) − H(A | B),
(3.3)
where H(A) denotes the uncertainty of input patterns and H(A | B) is the conditional uncertainty of the input patterns, given hidden units. This means that by knowing the occurrence of hidden units, the uncertainty about which input pattern is actually given is diminished. 3.2 Neural Networks to Be Controlled. Neural networks to be controlled are composed of input, hidden, and output units with the bias, as shown in Figure 2. The jth hidden unit receives the net input from input units and simultaneously from a collective information maximizer: ujs = xj +
L X
wjk ξks ,
(3.4)
k=0
where xj is an information maximizer from the collective information maximizer to the jth hidden unit, L is the number of input units, wjk is a connection from the kth input unit to the jth hidden unit, and ξks is the kth element of the sth input pattern. The jth hidden unit produces activity or output by a sigmoidal activation function vjs = f (ujs ) =
1 . 1 + exp(−ujs )
(3.5)
A net input into the ith output unit is hsi =
M X j=0
Wij vjs ,
(3.6)
1364
Ryotaro Kamimura
Bias
Bias- Hidden Connections
Bias-Output Connections Bias
Input Unit
w j0
Hidden Unit
Wi0 W ij
s
ξk Input- Hidden Connections
s
wjk
v sj
Individual Information Minimizer
Output Unit
HiddenOutput Connections
Oi
ζ
s
i
Target
x j Information Maximizers
Collective Information Maximizer
Information Controller Figure 2: A network architecture equipped with an information controller.
where M is the number of hidden units and Wij is a connection from the jth hidden unit to the ith output unit. An output from the ith output unit is Osi = f (hsi ).
(3.7)
For measuring errors between targets and outputs, we use a cross-entropy function (Kato, Tan, & Ejima, 1990; Solla, Levin, & Fleisher, 1988; Tan, Kato, & Ejima, 1990):
G=
" S N ½ X X
ζis log
s=1
i=1
ζis 1 − ζis + (1 − ζis ) log s Oi 1 − Osi
¾# ,
(3.8)
where Osi is an output from the ith output unit and ζis is a target for the ith output unit, N is the number of output units, and S is the number of input patterns. Rules for updating are obtained by differentiating the cross-
Information Controller
1365
entropy function with respect to hidden-output connections Wij −η
S X ∂G =η (ζis − Osi )vjs , ∂Wij s=1
(3.9)
where η is a learning parameter. 3.3 Collective Information Maximizer. A collective information maximizer is used to maximize the information contained in hidden units. For this purpose, we define the collective information in terms of the mutual information. Let us approximate a probability p(bj | as ) of the occurrence of the jth hidden unit, given the sth input unit, by a normalized output pjs of the jth hidden unit computed by pjs
vjs
= PM
s m=1 vm
,
(3.10)
where the summation is over all the hidden units. It is reasonable to suppose that at the initial stage, all the hidden units are activated randomly or uniformly and all the input patterns are also randomly given to networks. Thus, the probability p(bj ) of the activation of hidden units at the initial stage is equiprobable, that is, 1/M. The probability p(as ) of input patterns is also supposed to be equiprobable, namely, 1/S. Thus, the mutual information in equation 3.1 is rewritten as I(B; A) = −
M M S X X 1 1X 1 log + pjs log pjs M M S s=1 j=1 j=1
= log M +
M S X 1X ps log pjs , S s=1 j=1 j
(3.11)
where log M is a maximum entropy. Information maximizers are updated for increasing collective information. For obtaining update rules, we should differentiate the information function with respect to information maximizers xj , ∂I(B; A) ∂xj à ! S M X X s s s =β pm log pm pjs (1 − vjs ), log pj −
1xj = βS
s=1
where β is a parameter.
m=1
(3.12)
1366
Ryotaro Kamimura
Input Unit (C) k p
jk
Hidden Unit (D) j Fires
Fires
1-p
jk
Does not fire
Figure 3: An interpretation of an input-hidden connection for defining individual information.
3.4 Individual Information Minimizer. We should formulate an individual information minimizer, one of the major components of the information controller. Individual information represents information about the interrelation between input and hidden units. For representing this individual information by mutual information, we consider an output pjk from the jth hidden unit only with a connection from the kth input unit to the jth output unit, pjk = f (wjk ),
(3.13)
which is supposed to be a probability of the firing of the jth hidden unit, given the firing of the kth input unit, as shown in Figure 3. Since this probability is considered a probability, given the firing of the kth input unit, conditional mutual information is appropriate for measuring the information. In addition, it is reasonable to suppose that the probability of the firing of the jth hidden unit is one-half at the initial stage of the learning, because we have no knowledge on input units. Thus, the conditional mutual information for a pair of the kth input unit and the jth hidden unit is formulated as ¶ µ ¢ 1 ¡ 1 Ijk (D; f ires) = −pjk log − 1 − pjk log 1 − 2 2 + pjk log pjk + (1 − pjk ) log(1 − pjk ) = log 2 + pjk log pjk + (1 − pjk ) log(1 − pjk ).
(3.14)
If input-hidden connections are close to zero, this function is close to a minimum, a minimum information, meaning that it is impossible to estimate the firing of the kth hidden unit. If connections are larger, information is larger,
Information Controller
1367
and correlation between input and hidden units is stronger. Total individual information is the sum of all the individual individual information, I(D; f ires) =
L M X X
Ijk (D; fires),
(3.15)
j=1 k=0
because each connection is treated separately or independently. The individual information minimization directly controls input-hidden connections. By differentiating the individual information function and the cross-entropy cost function with respect to input-hidden connections wjk , we have rules for updating concerning input-hidden connections, ∂I(D; fires) ∂G −η ∂wjk ∂wjk = −µ wjk pjk (1 − pjk ) "( ) # S N X X s s s s s (ζi − Oi )Wij vj (1 − vj )ξk , +η
1wjk = −µ
s=1
(3.16)
i=1
where η and µ are parameters. Thus, rules for updating with respect to inputhidden connections are closely related to the weight decay method. Clearly, the individual information minimization corresponds to diminishing the strength of input-hidden connections. 4 Detection of Maximum Onset Principle In this section, we apply an information controller to the inference of the maximum onset principle, that is, a rule for the syllabification in an artificial language. It is shown that the individual information can be increased and simultaneously the collective information can be increased significantly. By individual information minimization and collective information maximization, the number of hidden units and the number of input-hidden connections can significantly be reduced. Finally, simplified or kernel networks can be obtained, whose internal representations show explicitly a mechanism of inference of the maximum onset principle. 4.1 Data and Network Architecture. The segmentation of words into syllables is sometimes very difficult. The maximum onset principle suggests one possibility of explicit syllabification (Nema, 1995). A syllable is composed of an onset and a rhyme. The maximum onset principle states that the number of consonants in the onset should be as large as possible under the condition that a consonant cluster is permitted or well formed at the beginning of words. Table 1 shows rules for the syllabification of an artificial language in the form of V1 C1 C2 V2 , where V and C are a vowel and consonant, respec-
1368
Ryotaro Kamimura
Table 1: Rules for Syllabification of an Artificial Language
Rule Number 1 2 3 4 5 6 7 8
Beginning of Words
Accent V1
V2
Well formed V´ 1 (short) V2 Well formed V´ 1 (long) V2 Well formed V1 V´ 2 V´ 2 Well formed V´ 1 V2 Ill formed V´ 1 V´ 2 Ill formed V1 V´ 2 Ill formed V´ 1 Well and ill formed V1 V2
Segmentation Types
Examples
(V1 C1 )(C2 V2 ) p´et.rol ´ (V1 )(C1 C2 V2 ) A.pril (V1 )(C1 C2 V2 ) p´a.trol (V1 )(C1 C2 V2 ) m´i.crob ` (V1 C1 )(C2 V2 ) f´ac.tory (V1 C1 )(C2 V2 ) um.p´ire (V1 C1 )(C2 V2 ) bon.b ´ on ´ Impossible No examples
tively. The maximum onset principle states that if a consonant cluster C1 C2 is permitted at the beginning of words, a string V1 C1 C2 V2 should be decomposed into (V1 )(C1 C2 V2 ), that is, the number of consonants in onsets should be maximum. Rules 2 through 7 show cases where the maximum onset principle can directly be applied. For example, rule 3 shows that if a consonant cluster C1 C2 is possible at the beginning of words and the second vowel has an accent, then a string V1 C1 C2 V2 should be decomposed into (V1 )(C1 C2 V2 ). Rule 5 shows that if a consonant cluster C1 C2 is not possible at the beginning of words and the first vowel has an accent, then a string V1 C1 C2 V2 should be decomposed into (V1 C1 )(C2 V2 ). One possible exceptional case for the maximum onset principle is that if the first vowel is short, the segmentation has a form of (V1 C1 )(C2 V2 ). Rule 1 shows that if a consonant cluster C1 C2 is possible at the beginning of words, and the first vowel V1 has an accent, that is, V´ 1 , and the vowel is a short vowel, a string V1 C1 C2 V2 should be decomposed into (V1 C1 )(C2 V2 ). For example, the word p´etrol should be decomposed into (p´et)(rol), because the vowel is short. Finally, rule 8 shows that at least one vowel should have an accent. In our language system, there are no words without accents. Figure 4 shows a network architecture used in experiments. There were 4 input units and 2 output units. The number of hidden units was increased from 10 to 20. In Figure 4, we can see that if the first vowel has an accent and is a long or a diphthong vowel, then V1 C1 C2 V2 should be decomposed into (V1 )(C1 C2 V2 ). 4.2 Collective Information Maximization and Individual Information Minization. In this section, it is shown that individual information minimization is not contradictory to collective information maximization. This means that the individual information can be minimized and simultaneously the collective information can significantly be increased.
Information Controller
1369
(a) Bias
Bias
Accent(V1 ) Long/Short(V1 ) Consonant (C C ) 1 2 Cluster Accent(V2 )
(V1 )(C1 C2 V2 ) Output Units
Input Units
Hidden Units
(b)
Figure 4: Network architectures for rule 2 of Table 1. If the first vowel has an accent and is a long or diphthong vowel, then V1 C1 C2 V2 should be decomposed into (V1 )(C1 C2 V2 ).
1370
Ryotaro Kamimura
1.4
β=0.000 β=0.001 β=0.003 β=0.007 β=0.015
β=0.000 β=0.001 β=0.003 β=0.007 β=0.015
0.45 0.4
1
Individual Information
Collective Information
1.2
0.8 0.6 0.4
0.35 0.3 0.25 0.2 0.15
0.2
0.1 0 0
3e-05
µ
6e-05
9e-05
0
3e-05
(a)
6e-05
9e-05
(b) 0.3
β=0.00 β=0.01 β=0.02 β=0.03 β=0.04
1.4
β=0.00 β=0.01 β=0.02 β=0.03 β=0.04
0.25
Individual Information
1.2
Collective Information
µ
1 0.8 0.6 0.4
0.2
0.15
0.1 0.2 0.05
0 0
1e-05
µ
(c)
2e-05
3e-05
0
1e-05
µ
2e-05
3e-05
(d)
Figure 5: Collective (a,c) and individual (b,d) information. (a,b) Using 10 hidden units. (c,d) Using 20 hidden units. For 10 hidden units, a parameter β was increased from 0 to 0.015. A parameter µ was increased from 0 to 9 × 10−5 . For 20 hidden units, a parameter β was increased from 0 to 0.04 and a parameter µ was increased from 0 to 5 × 10−5 . Note that a parameter η was always 0.01.
Figure 5 shows individual information and collective information as a function of the parameter µ. Two types of information were normalized so as to range between zero (minimum) and one (maximum). Figures 5a and b show the individual information and collective information with 10 hidden units. In Figure 5a, the collective information is increased from 0.115 (β = 0) to 0.923 (β = 0.015). From Figure 5b, we can see that the individual information can be significantly decreased as the parameter µ is increased from 0 to 9×10−5 . For example, when the parameter β is 0.015, the individual information is decreased from 0.276 (µ = 0) to 0.094 (µ = 9 × 10−5 ). Then the number of hidden units was increased from 10 to 20. As shown in Figure 5c, the collective information is increased from 0.078 (β = 0) to 0.981 (β = 0.04). The individual information is sufficiently decreased. For example, when the parameter β is 0.03, the individual information is decreased from 0.149 (µ = 0) to 0.062 (µ = 3 × 10−5 ). These results show that collective information can be increased and simultaneously individual information is decreased.
Information Controller
1371
4.3 Interpretation of Internal Representations. The information controller can generate very explicit internal representations. Figure 6 shows the strength of connections into 10 hidden units from 4 input units (wjk ), from the bias (wj0 ) and the information maximizer (xj ). Hidden units were ordered by the magnitude of the relevance of each hidden unit (Mozer & Smolensky, 1989). All the input-hidden connections except connections into the first two hidden units are almost zero. The information maximizers xj are all strongly negative for these cases. These negative information maximizers make eight hidden units (from the third to tenth hidden unit) inactive, that is, close to zero. Figure 7 shows obtained kernel networks. One of the most important characteristics of the first network is that a connection about the well formedness of a consonant cluster C1 C2 is overwhelmingly large: 26.48. This means that if a consonant cluster is permitted or well formed at the beginning of words, that is, the third input unit is on, the hidden unit has a high possibility of being strongly activated and the first and the second output units are strongly activated and inactive, respectively, meaning that a string V1 C1 C2 V2 should be decomposed into (V1 )(C1 C2 V2 ). If a consonant cluster is ill formed at the beginning of words, the string V1 C1 C2 V2 should be decomposed into (V1 C1 )(C2 V2 ), because the hidden unit is inactive by the strong negative bias and the first and the second output units are inactive and strongly activated by the bias, representing 01. Thus, the first hidden unit justly represents the maximum onset principle. The second hidden unit in Figure 7b clearly shows an exceptional case where in our artificial language at least one vowel must have an accent. As can be seen in the figure, if both vowels have no accents, the hidden unit can be activated and output units are all inactive, that is, an impossible case. 5 Maximum Onset and Sonority Principle In this section, a problem to be solved by networks becomes much more complicated. A sonority principle regulating the well formedness of consonant clusters at the beginning of words is incorporated into data of the maximum onset principle. Thus, networks must infer the sonority principle in addition to the maximum onset principle. Experimental results confirmed that generalization performance by the information controller was much better than the performance by other methods. Finally, we note that by examining closely internal representations, explicit mechanism of the network’s behavior could be observed. 5.1 Well Formedness by the Sonority Principle. To examine whether more complicated problems can also be solved in our information controller and generalization performance can be improved, additional regularity called sonority is incorporated into the data of the maximum onset principle. In our artificial language, consonant clusters at the beginning of
1372
Ryotaro Kamimura
Strength of Input-Hidden Connections
30
Input Unit No.1 Input Unit No.2 Input Unit No.3 Input Unit No.4 Bias Information Maximizer
20 10 0 -10 -20 -30 -40 -50 -60 1
2
3
4
5 6 Hidden Unit
7
8
9
10
Figure 6: Strength of connections into 10 hidden units from four input units (wjk ), from the bias (wj0 ) and the information maximizer (xj ) by the information controller. The parameters β, µ, and η were 0.015, 0.0008, and 0.01. The hidden and individual information were 0.94 and 0.129 for the information controller.
words must strictly observe the sonority principle: that each consonant has its own sonority value, as shown in Table 2 (Ladefoged, 1993; Nema, 1994). The principle can be summarized as follows: Suppose that two consonants C1 C2 must be placed at the beginning of words. If the sonority of the precedent consonant C1 is smaller than that of the following consonant C2 , then the consonant cluster is well formed or permitted at the beginning of words. Otherwise, the consonant cluster is not permitted at the beginning of words, that is, it is ill formed. Figure 8 shows a sonority system with examples. As shown in Figure 8a, a consonant cluster /pr/ is well formed, because the sonority of /p/ (sonority = 1) is less than the sonority of /r/ (sonority = 7). On the other hand, a consonant cluster /rp/ is ill formed, because the sonority of /p/ is less than the sonority of /r/, as shown in Figure 8b. 5.2 Data Encoding and Network Architectures. Figure 9 shows a network architecture used in experiments. The number of input, hidden, and output units were 13, 20, and 2 units, respectively. The first three input units represent the existence of the accent for the first vowel V1 , the types of the vowel, that is, a short vowel or a long vowel (including a diphthong), and the existence of the accent for the second vowel V2 . The figure shows that the first vowel has an accent and is a short vowel and the second vowel has no accent. The remaining input units represent two consonants represented in a phonological representation (Plunkett, Marchman, & Knudsen, 1990). In the figure, a consonant cluster /pl/ is given to the network. The consonants /p/ and /l/ are represented in 01111 10110 in the phonological representa-
Information Controller
Accent
1373
Input Units
Output Units
(V1 ) 3.09 Long/Short (V1 ) 10.77 Consonant Cluster (C1 + C2)
Accent (V2 )
Hidden Unit
-37.83
-2.05 4.63
-60.88+22.07
26.48
Bias+ Information Maximizer
-4.18
2.13
13.82
(a)
Accent (V1 )
Input Units
-3.35 Long/Short (V1 )
Output Units Hidden Unit
0.68
-2.05 -0.69
0.11 1.63-0.95 Initial (C1 + C2)
Accent (V2 )
0.33
-5.30 Bias+ Information Maximizer
2.13
-3.08 (b)
Figure 7: Kernel networks obtained by the information controller. (a) A kernel network for the maximum onset principle. (b) A kernel network for an exceptional case in which both vowels have no accent. The parameters β, µ, and η were 0.015, 0.0008, and 0.01.
1374
Ryotaro Kamimura
Table 2: Sonority Hierarchy of an Artificial Language Order
Features
Examples
1 2 3 4 5 6 7 8
Voiceless stop Voiced stop Voiceless fricative Voiced fricative Nasal Lateral Retroflex Semivowel
[p, t, k] [b, d, g] [f, θ, s] [v, δ, z] [m, n, η] [l] [r] [y, w]
Source: Nema (1994).
Consonant Cluster ( C1 C 2 )
Well-formed
Sonority(C 1)
<
(a) /pr/
Sonority(C 2)
Ill-formed
Sonority(C 1)
> =
Sonority(C 2)
(b) /rp/
Figure 8: Well formedness of consonant clusters at the beginning of words.
tion. Since the sonority of /p/ is smaller than the sonority of /l/ according to Table 2, the consonant cluster is well formed. This situation corresponds to the rule 1 of Table 1. Since the first vowel V1 has an accent, and it is a short vowel, a string V1 C1 C2 V2 should be decomposed into (V1 C1 )(C2 V2 ). 5.3 Improving Generalization. By using the information controller, generalization performance can significantly be improved. The improved performance is due to the suppression of the overtraining. Table 3 shows cases for 100, 200, and 300 training patterns. Generalization performance by the information controller was compared with P performance 2 ), and by the standard method (β = 0 and µ = 0), weight decay ( jk wjk P 2 2 )) (Weigend et al., 1992). Generalization weight elimination ( jk wjk /(1 + wjk errors in the table were the best errors, that is, the smallest errors in terms of
Information Controller
1375
Bias
Bias (1) Accent(V1 ) (2) Long/Short(V1 ) (3) Accent(V2 ) (4) Voicing (5)
C1 (6)
Manner /p/
(7) (8) (9)
Place Output Units
Voicing
C 2 (10) Manner
(V1 C1 )(C 2 V 2)
/l/
(11)
(12) Place (13) Input Units
Hidden Units
Information Controller
Individual Information Minimizer
Collective Information Maximizer
Figure 9: A network architecture for rule 1 of Table 1. A well-formed string /pl/ is given to a network. Thus, the network must produce the segmentation that V1 C1 C2 V2 should be decomposed into (V1 C1 )(C2 V2 ).
the root-mean-squared (RMS) errors for each trial, because all the methods except the information controller frequently show overtraining. Then averages were over 7 best trials of 10 trials with different initial conditions. A parameter η was 0.05. A parameter λ for the weight decay and weight elimination was increased from 0 to 0.001 by 10−5 . A parameter β was increased from 0 to 0.007 by 0.0005. A parameter µ was increased from 0 to 0.0001 by 5 × 10−5 . Learning was considered to be finished when the absolute value between targets and outputs was lower than 0.1, that is, |ζis − Osi | < 0.1 for all the output units and all the input patterns or the number of epochs is larger than 300 epochs (×S updates). Even if the learning was forced to
1376
Ryotaro Kamimura
Table 3: Generalization Performance Comparison Generalization Errors RMSa Methods
Averages
Error Rates
Std.
Dev.b
Averages
Std. Dev.b
100 patterns Standard Weight decay Weight elimination Information controller
0.246 0.229 0.215 0.208
0.008 0.010 0.017 0.012
0.152 0.131 0.106 0.082
0.020 0.031 0.021 0.013
200 patterns Standard Weight decay Weight elimination Information controller
0.188 0.183 0.172 0.167
0.010 0.004 0.014 0.011
0.087 0.082 0.064 0.052
0.015 0.009 0.015 0.008
300 patterns Standard Weight decay Weight elimination Information controller
0.108 0.110 0.083 0.072
0.009 0.003 0.005 0.006
0.024 0.012 0.009 0.008
0.009 0.004 0.006 0.004
a RMS
= root mean squared error. devation.
b Standard
continue beyond this point, generalization errors were not improved for all the methods except the information controller. In addition to the errors in RMS, an error rate was computed for the intuitive interpretation of the results. The error rate was the number of incorrectly estimated new patterns divided by the total number of new patterns. For computing the error rate, outputs Osi were set to be one for Osi ≥ 0.5 and the outputs are considered to be zero for Osi < 0.5. Table 3 shows generalization errors for 100 input patterns. In terms of RMS, generalization errors are decreased from 0.246 (standard method) to 0.229 (weight decay), to 0.215 (weight elimination), and finally to 0.208 (information controller). In terms of error rates, the generalization errors are decreased from 0.152 to 0.131 (weight decay), to 0.106 (weight elimination), and finally to 0.082 (information controller). Regarding the generalization errors for 200 input patterns, in terms of the RMS, the best generalization error (0.167) is obtained by the information controller. In terms of the error rates, the best generalization errors (0.052) are also obtained by the information controller. When the number of input patterns was increased from 200 to 300 input patterns, the best generalization performance, 0.072, in terms of RMS and 0.008 in terms of the error rates is obtained when the information controller is used. Thus, these experimental results confirmed that in all
Information Controller
1377
0.2
0.35 Information Controller Weight Elimination Standard Method
Information Controller Weight Elimination Standard Method
0.3
0.16
Training Errors (RMS)
Generalization Errors (RMS)
0.18
0.14 0.12 0.1
0.25 0.2 0.15 0.1
0.08 0.05 0.06 0 0
50
100 150 200 Number of Epochs
250
300
0
50
(a) 0.9
0.35
300
Information Controller Weight Elimination Standard Method
0.3
0.7
Individual Information
Collective Information
250
(b) Information Controller Weight Elimination Standard Method
0.8
100 150 200 Number of Epochs
0.6 0.5 0.4 0.3
0.25 0.2 0.15 0.1
0.2 0.05
0.1 0
0 0
50
100 150 200 Number of Epochs
(c)
250
300
0
50
100 150 200 Number of Epochs
250
300
(d)
Figure 10: (a) Generalization errors. (b) Training errors. (c) Collective information. (d) Individual information. The number of hidden units was 20. The parameters β, µ, and η were 0.002, 0.0001, and 0.05, respectively. The parameter λ for the weight elimination was 0.00065.
cases, the generalization performance of the information controller is better than the other methods. Next we examine why the information controller can improve generalization performance. As the collective information is increased and the individual information is decreased, the number of hidden units and inputhidden connections is smaller, preventing networks from plunging into overtraining. Figure 10 shows generalization errors, training errors, collective information, and individual information as a function of the number of epochs, when the number of input patterns was 200 patterns. In Figures 10a and 10b, the standard method (dotted line) and weight elimination (broken line) show typical overtraining, meaning that generalization errors are increased (a), though training errors are constantly decreased (b). On the other hand, by the information controller (solid line), generalization errors are constantly decreased as training errors are decreased. In Figure 10c, the collective information by the information controller is increased to 0.775, while by weight elimination, the collective information continues to be small
1378
Ryotaro Kamimura
(0.085). Since the individual information (d) shows the same value for the information controller (0.121) and weight elimination (0.121), the collective information maximization, that is, the reduction of the necessary number of hidden units, is effective in preventing networks controlled by the information controller from plunging into overtraining. 6 Conclusion In this article, we have attempted to unify the information maximization and minimization method to improve generalization performance and provide an explicit interpretation of internal representation created by neural networks. The unification has been much facilitated by introducing an information controller located outside neural networks and used only to control the information contained in neural networks. The seemingly contradictory operation of information maximization and minimization has been solved by assuming the collective information maximizer and individual information minimizer. We have applied the methods to the acquisition of rules of artificial languages. Experimental results have shown that generalization performance can be improved significantly by the collective information maximization and simultaneously by the individual information minimization. In addition, internal representations created by the information controller have explicitly been interpreted. One problem in applying information control, especially to complex problems, should be noted. Collective information maximization and individual information minimization are mutually contradictory in terms of the control of hidden unit activity patterns. Some useless hidden units happen to be generated, for example, hidden units functioning as a bias. Although it is easy to detect these useless units, some techniques to remove them automatically in the process of information maximization and minimization will be beneficial to practical large-scale application of this method. Much attention has been paid to information maximization and minimization methods applied to neural computing. All these methods have been concerned with the exclusive maximization and minimization. However, in biological systems, it is not enough to assume that information is either exclusively maximized or minimized. The information is certainly controlled differently at different levels in the systems. Thus, the information controller for maximizing and minimizing information will have a great potential for clarifying the mechanism of information control in living systems. Acknowledgments I am deeply indebted to Shohachiro Nakanishi (Department of Electrical Engineering, Tokai University) and reviewers for their valuable comments on this article.
Information Controller
1379
References Abramson, N. (1963). Information Theory and Coding. New York: McGraw-Hill. Akiyama, Y., & Furuya, T. (1991). An extension of the back-propagation learning rule which performs entropy maximization as well as error minimization, (IEICE Tech. Rep. NC91-6). Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network 3:213–251. Barlow, H. B. (1989). Unsupervised learning. Neural Computation 1:295–311. Barlow, H. B., Kaushal, T. P., & Mitchison, G. J. (1989). Finding minimum entropy code. Neural Computation 1:412–423. Bridle, J., MacKay, D., & Heading, A. (1992). Unsupervised classifier, mutual information and phantom targets. In Neural Information Processing Systems (pp. 1096–1101). San Mateo, CA: Morgan Kaufmann. Deco, G., Finnof, W., & Zimmermann, H. G. (1993). Elimination of overtraining by a mutual information network. In Proceeding of the International Conference on Artificial Neural Networks (pp. 744–749). New York: Springer-Verlag. Deco, G., Finnof, W., & Zimmermann, H. G. (1995). Unsupervised mutual information criterion for elimination of overtraining in supervised multilayer networks. Neural Computation 7:86–107. Ishikawa, M., & Yamamoto, Y. (1994). Discovery of rules and generalization by structural learning of neural networks. Brain and Neural Networks 1(2):57–63. Kamimura, R. (1992). Entropy minimization to increase the selectivity: Selection and competition in neural networks. In Intelligent Engineering Systems Through Artificial Neural Networks (pp. 227–232). New York: American Society of Mechanical Engineers. Kamimura, R. (1994). Generation of internal representation by α-transformation. In D. S. Touretzky (Ed.), Advances in Neural Information Processing Systems (pp. 733–740). San Mateo, CA: Morgan Kaufmann. Kamimura, R., & Nakanishi, S. (1995). Hidden information maximization for feature detection and rule discovery. Network: Computation in Neural Systems 6:577–602. Kamimura, R., Takagi, T., & Nakanishi, S. (1995). Improving generalization performance by information minimization. IEICE Transactions on Information and Systems E78-D(2):163–173. Kato, Y., Tan, Y., & Ejima, T. (1990). A comparative study with feedforward PDP models for alphanumeric character recognition. Transaction of the Institute of Electronics, Information and Communication Engineers J73-D-II(8):1249–1254. Kroph, A., & Hertz, J. A. (1992). A simple weight decay can improve generalization. In D. S. Touretzky (Ed.), Advances in Neural Information Processing Systems (pp. 950–957). San Mateo, CA: Morgan Kaufmann. Ladefoged, P. (1993). A course in phonetics. New York: Harcourt Brace. Linsker, R. (1988). Self-organization in a perceptual network. Computer 21:105– 117.
1380
Ryotaro Kamimura
Linsker, R. (1989a). An application of the principle of maximum information preservation to linear system. In Neural Information Processing Systems (pp. 186–194). San Mateo, CA: Morgan Kaufmann. Linsker, R. (1989b). How to generate ordered maps by maximizing the mutual information. Neural Computation 1:402–411. Mozer, M. C., & Smolensky, P. (1989). Using relevance to reduce network size automatically. Connection Science 1(1):3–16. Nema, H. (1994). Phonotactics and sonority. Senshu Journal of Foreign Language and Education 23:65–94. Nema, H. (1995). Phonotactis and syllabification in English. Studies in the Humanities, 56:23–61. Plunkett, K., Marchman, V., & Knudsen, S. L. (1990). From rote learning to system building: Acquiring verb morphology in children and connectionist nets. In D. S. Touretzky, J. L. Elman, & G. E. Hinton (Eds.), Connectionist Models: Proceedings of the 1990 Summer School (pp. 201–219). San Mateo, CA: Morgan Kaufmann. Tan, Y., Kato, Y., & Ejima, T. (1990). Error functions to improve learning speed in PDP models: A comparative study with feedforward PDP models. Transaction of the Institute of Electronics, Information and Communication Engineers J73-DII(12):2022–2028. Solla, S. A., Levin, E., & Fleisher, M. (1988). Accelerated learning in layered neural networks. Complex Systems 2:625–639. Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1992). Generalization by weight-elimination with application to forecasting. In Neural Information Processing Systems (pp. 950–957). San Mateo, CA: Morgan Kaufmann.
Received September 8, 1995; accepted October 10, 1996.
Communicated by Steven Nowlan
Controlling Hidden Layer Capacity Through Lateral Connections Kwabena Agyepong Ravi Kothari∗ Artificial Neural Systems Laboratory, Department of Electrical and Computer Engineering and Computer Science, University of Cincinnati, Cincinnati, OH 45221-0030, U.S.A.
We investigate the effects of including selected lateral interconnections in a feedforward neural network. In a network with one hidden layer consisting of m hidden neurons labeled 1, 2, . . . , m, hidden neuron j is connected fully to the inputs, the outputs, and hidden neuron j + 1. As a consequence of the lateral connections, each hidden neuron receives two error signals: one from the output layer and one through the lateral interconnection. We show that the use of these lateral interconnections among the hidden-layer neurons facilitates controlled assignment of role and specialization of the hidden-layer neurons. In particular, we show that as training progresses, hidden neurons become progressively specialized—starting from the fringes (i.e., lower and higher numbered hidden neurons, e.g., 1, 2, m − 1, m) and leaving the neurons in the center of the hidden layer (i.e., hidden-layer neurons numbered close to m/2) unspecialized or functionally identical. Consequently, the network behaves like network growing algorithms without the explicit need to add hidden units, and like soft weight sharing due to functionally identical neurons in the center of the hidden layer. Experimental results from one classification and one function approximation problems are presented to illustrate selective specialization of the hidden-layer neurons. In addition, the improved generalization that results from a decrease in the effective number of free parameters is illustrated through a simple function approximation example and with a real-world data set. Besides the reduction in the number of free parameters, the localization of weight sharing may also allow for a method that allows procedural determination for the number of hidden-layer neurons required for a given learning task. 1 Introduction The influence of the size of a neural network on its generalization performance is well established (Baum & Haussler, 1989). Given a lack of explicit
∗
Corresponding author
Neural Computation 9, 1381–1402 (1997)
c 1997 Massachusetts Institute of Technology °
1382
Kwabena Agyepong and Ravi Kothari
knowledge on the size of a network to use for a given application, several techniques have been proposed to allow for good generalization (see Bishop, 1995, for an excellent discussion). We briefly describe some of the more popular methods. Network growing and pruning algorithms offer a viable mechanism for arriving at a network with acceptable performance on the training and validation data sets. These algorithms treat the network structure as a parameter of optimization along with the weights. Pruning algorithms typically start with a large network and proceed by removing weights to which sensitivity of the error is minimal (e.g., Mozer & Smolensky, 1989; Karnin, 1990; LeCun, Denker, & Solla, 1990; Hassibi & Stork, 1993; Reed, 1993). In other pruning methods, such as weight decay, a regularization term is included in the error function to be minimized, to encourage smooth mappings (Plaut, Nowlan, & Hinton, 1986; Chauvin, 1989; Weigend, Rumelhart, & Huberman, 1991). A disadvantage of network pruning methods is that the composite effect of two or more weight removals is not computationally feasible. In addition, some form of retraining is often required after the pruning process. Network growing methods (e.g., Gallant, 1986; Nadal, 1989; Fahlman & Lebiere, 1991; Cios & Liu, 1992; Bose & Garga, 1993; Kwok & Yeung, 1995), on the other hand, typically start from a small network and add neurons and interconnections as more are needed. Typically these approaches add a neuron with full connectivity to existent neurons in the network (competition or pruning may be used to alter the connectivity of the added neuron to the existent network) when a suitably chosen measure (entropy, covariance, etc.) stops decreasing. Usually the newly added neuron is trained on the residual error using one of the following possibilities: (1) fix the rest of the network parameters (adjust the weights coming into and going out of the newly added neuron by training with the entire training data) or (2) adjust the weights coming into and going out of the newly added neuron by using the training pattern(s) giving the most error. In the first case the composite learning ability of the existent hidden neurons and the newly added one is not taken into account since the existent network weights are frozen. Some form of back-fitting (Kwok & Yeung, 1995) thus has to be used to fine-tune the existent hidden units, along with training the newly added hidden unit. In the second case, which is most often used with hidden neurons using the radial basis function, the number of hidden neurons is often influenced by the noise in the training patterns. An alternative to these methods is called soft weight sharing (Nowlan & Hinton, 1992; see also Rumelhart, McClelland, & PDP Research Group, 1986), wherein groups of weights are encouraged to have equal values, allowing for a reduction in the effective number of free parameters in the network. Such grouping of weights is achieved by considering a regularization term, formed by the negative logarithm of the likelihood of the weight values modeled as a mixture of normal distributions. Soft weight sharing has the advantage that it can train large networks with small amounts of
Controlling Hidden Layer Capacity
1383
training data. However, it requires some care in the proper initialization of the weights to ensure that good solutions are found. In this article we show that the use of (local unidirectional) lateral connections in a feedforward network also leads to soft weight sharing. In particular, we show that the use of lateral connections leads to a controlled assignment of role in the hidden-layer neurons beginning with the fringe neurons (i.e., lower- and higher-numbered hidden neurons, e.g., 1, 2, m − 1, m), leaving neurons in the center of the hidden layer (i.e., hidden-layer neurons numbered close to m/2) unspecialized. Consequently, the network behaves like network growing algorithms without the explicit need to add hidden units, and like soft weight sharing due to functionally identical neurons in the center of the hidden layer. Although a reduction in the number of free parameters is a consequence of this soft weight sharing, the localization of weight sharing that seems to be characteristic of feedforward networks with the proposed lateral connections may also allow for a method that allows procedural determination for the number of hidden-layer neurons required for a given learning task. In section 2, we introduce the network architecture and a gradient-descent-based learning procedure. In section 3, we show analytically that the use of lateral connections results in the specialization of the fringe hidden neurons with functionally identical neurons in the center of the hidden layer. In section 4, we show this effect experimentally through one classification and one function approximation problem. In addition, we also show the improved generalization that results from the proposed method. In section 5, we present the conclusions. 2 Lateral Connections in a Feedforward Network In this section, we present a feedforward network with selected lateral connections and summarize a gradient descent algorithm for training it. We first introduce the nomenclature that will be used: J {(ξ µ , ζ µ ): µ = 1, 2, . . . , p} n, m, o µ
hj
µ
zj
µ
si
µ
yi
Cost function to be minimized p Input-output training patterns Number of input, hidden, and output neurons, respectively Net input to hidden-layer neuron j on pattern µ Output of hidden-layer neuron j on pattern µ Net input to output-layer neuron i on pattern µ Output of output-layer neuron i on pattern µ
Controlling Hidden Layer Capacity
1385
The output of a hidden-layer neuron is a (nonlinear) function of its net input: µ
µ
zj = f (hj )
(2.2)
An output-layer neuron receives as net input, µ
si =
m X
µ
Wij zj ,
(2.3)
j=1
and produces an output: µ
µ
yi = f (si ).
(2.4)
We note that the output-layer neurons can also employ a linear activation function. The adjustment of the weights is done to minimize the sum of squared Pp µ µ error between the output yi and the desired output ζi , that is, J = µ=1 Jµ , where: Jµ =
o 1X µ µ 2 (ζi − yi ) . 2 i=1
(2.5)
To obtain the learning algorithm, we use gradient descent to minimize Jµ . For the hidden to the output-layer weights, we obtain, µ
µ
µ
µ
µ
1Wij = (ζi − yi ) f 0 (si )zj µ µ
= ηδi zj µ
µ
(2.6) µ
µ
where, δi = (ζi − yi ) f 0 (si ). The weight-update equations for the lateral weights are: µ
∂ Jµ ∂qj,j−1 ! "Ã o X µ 0 µ = ηq δi Wij f (hj )
1qj,j−1 = −ηq
i=1
+
à m o X X
β=j+1
µ
= ηq δj +
! µ
m X β=j+1
µ
δβ
µ
δi Wiβ f 0 (hβ )
i=1 β Y α=j+1
β Y α=j+1
µ qα,α−1 hj−1
µ
qα,α−1 hj−1 ,
(2.7)
1386
Kwabena Agyepong and Ravi Kothari
P µ µ µ where, δj = ( oi=1 δi Wij f 0 (hj )) is the normal backpropagation delta at µ the hidden-layer neurons. Thus, instead of backpropagating δi through the µ output–hidden weights, lateral backpropagation propagates δi through the output–hidden weights, as well as the lateral weights. The update equations for the input to the hidden-layer weights are similar to the above: µ
∂ Jµ ∂wjk "Ã o X
1wjk = −η =η
! µ µ δi Wij f 0 (hj )
i=1
+
à m o X X
β=j+1
µ
= η δj +
! µ
m X
µ
β=j+1
δβ
µ
δi Wiβ f 0 (hβ )
i=1 β Y α=j+1
β Y α=j+1
µ qα,α−1 ξk
µ
qα,α−1 ξk
(2.8)
These weight-update equations are identical to what one might obtain using backpropagation for the proposed architecture. For ease of reference, however, we term backpropagation for a feedforward network with lateral weights as lateral backpropagation (Kothari & Agyepong, 1996). We show next that in lateral backpropagation as training progresses, hidden neurons become progressively specialized—starting from the fringes and leaving the neurons in the center of the hidden layer unspecialized, that is, weight sharing occurs in the central portion of the hidden layer. 3 Controlled Specialization of Hidden-Layer Neurons Without loss of generality, we assume a single output unit—the network of size n×m×1 (n inputs, m hidden-layer neurons, and 1 output). Enumerating the net input received by the hidden-layer neuron (see equation 2.1), we obtain: µ
hj =
n X
µ
wjk ξk + qj,j−1
k=1
n X
n ¡ ¢X µ µ wj−1,k ξk + qj,j−1 qj−1,j−2 wj−2,k ξk
k=1
¡
+ · · · + qj,j−1 qj−1,j−2 qj−2,j−3 . . . q21
k=1 n ¢X
µ
w1k ξk
(3.1)
k=1
If all the lateral weights are initialized to be equal, that is, qj,j−1 = q∗ , and all the input to hidden weights is also initialized to be equal, that is, wjk = w∗ ,
Controlling Hidden Layer Capacity
1387
equation 3.1 simplifies to: µ
hj =
n X
µ
w∗ ξ k
k=1
à =
1 − q∗j 1 − q∗
³
1 + q∗ + q∗2 + . . . + q∗j−1
!
n X
´
µ
w∗ ξk .
(3.2)
k=1
From equation 3.2, one may observe that beyond some value t corresponding to a neuron position in the hidden layer, the net input to the hidden neurons will be the same for j > t, where 1 ≤ t ≤ m, when q∗ < 1. µ ˜ for t < j ≤ m. In other words, hj = h; µ The expression for zj is: µ
zj =
1 1+e
µ
−hj
1
= 1+e
1−q∗j − 1−q∗
Pn k=1
µ
w∗ ξk
.
(3.3)
In the forward pass, hidden neurons j ≤ t differentiate (i.e., have different outputs), while hidden neurons j > t behave similarly. We next show what happens during the weight updates. For the output µ µ µ µ to hidden weights, from equation 2.6, we have 1Wij = ηδi zj where δi = µ µ 0 µ µ µ 0 µ (ζi − yi ) f (si ). For a single output (ζi − yi ) f (si ) is a constant for all µ µ weights in each pass. Thus, 1Wij has the same profile as zj , which has the µ same profile as hj . For a single-output neuron (i = 1), enumerating the expression for the lateral weight updates (see equation 2.7), we get: h µ µ µ µ µ ∗m−j µ hj−1 + δi Wi,m−1 f 0 (hm−1 )q∗m−j−1 hj−1 1qj,j−1 = ηq δi Wim f 0 (hµ m )q i µ µ µ µ µ µ + · · · + δi Wi,j+1 f 0 (hj+1 )q∗ hj−1 + δi Wij f 0 (hj )hj−1 . (3.4) Rewriting the above, we get: µ
µ
h
µ
µ
µ
∗m−j hj−1 + Wi,m−1 f 0 (hm−1 )q∗m−j−1 hj−1 Wim f 0 (hµ m )q i µ µ µ µ + · · · + Wi,j+1 f 0 (hj+1 )q∗ hj−1 + Wij f 0 (hj )hj−1 . (3.5)
1qj,j−1 = ηq δi
Since from equation 3.2, the output of hidden neurons j > t is the same, µ µ = zj (1−zj ) = f˜, ∀t < j ≤ m. Further, if the hidden to output weights
µ f 0 (hj )
1388
Kwabena Agyepong and Ravi Kothari
˜ we obtain, are initialized to be equal (Wij = W), ¤ µ £ µ 1qm,m−1 = ηq δi Wim f 0 (hµ m ) hm−1 µ ˜ ˜˜ fh = ηq δi W
1qm−1,m−2
µ ∗¶ µ ˜ ˜˜ 1 − q = ηq δi W fh 1 − q∗ ¤ µ £ µ µ ∗ 0 µ = ηq δi Wim f 0 (hµ m )q + δi Wi,m−1 f (hm−1 ) hm−2 i h µ ˜ ˜ ∗ µ ˜ ˜ ˜ f q + δi W f h = ηq δi W
¢ µ ˜ ˜¡ ∗ f q + 1 h˜ = ηq δi W µ ∗2 ¶ µ ˜ ˜˜ 1 − q = ηq δi W f h 1 − q∗
(3.6)
(3.7)
.. .
h µ µ µ ∗m−t−1 + δi Wi,m−1 f 0 (hm−1 )q∗m−t−2 1qt+1,t = ηq δi Wim f 0 (hµ m )q ¤ µ µ µ µ µ + · · · + δi Wi,t+2 f 0 (ht+2 )q∗ + δi Wi,t+1 f 0 (ht+1 ) ht ³ ´ µ ˜ ˜ ∗m−t−1 µ f q = ηq δi W + q∗m−t−2 + · · · + q∗ + 1 ht µ ∗m−t ¶ µ ˜ ˜ µ 1−q = ηq δi W f ht 1 − q∗ h µ µ µ m−t + δi Wi,m−1 f 0 (hm−1 )q∗m−t−1 1qt,t−1 = ηq δi Wim f 0 (hµ m )q µ
µ
µ
(3.8)
µ
+ · · · + δi Wi,t+2 f 0 (ht+2 )q∗2 + δi Wi,t+1 f 0 (ht+1 )q∗ µ µ ¤ µ + δi Wi,t f 0 (ht ) ht−1 ³ ´ µ ˜ ˜ ∗m−t µ f q = ηq δi W + q∗m−t−1 + · · · + q∗2 + q∗ ht−1 µ
µ
µ
+ηδi Wit f 0 (ht )ht−1 µ ∗ ¶ q − q∗m−t+1 µ ˜ ˜ µ µ µ µ f ht−1 = ηq δi W + ηq δi Wit f 0 (ht )q∗ ht−1 1 − q∗ .. .
h µ µ µ ∗m−2 + δi Wi,m−1 f 0 (hm−1 )q∗m−3 1q21 = ηq δi Wim f 0 (hµ m )q µ
µ
µ
µ
+ · · · + δi Wi,t+1 f 0 (ht+1 )q∗t−1 + δi Wi,t f 0 (ht )q∗t−2 µ
µ
µ
µ
+δi Wi,t−1 f 0 (ht−1 )q∗t−3 + · · · + δi Wi,3 f 0 (h3 )q∗ µ µ µ¤ + δi Wi,2 f 0 (h2 ))h1 ³ ´ µ ˜ ˜ ∗m−2 µ f q + q∗m−3 + · · · + q∗t−1 h1 = ηq δi W h µ µ µ +ηδi Wit f 0 (ht )q∗t−2 + Wi,t−1 f 0 (ht−3 )q∗t−1 + · · ·
(3.9)
Controlling Hidden Layer Capacity µ µ ¤ µ + Wi3 f 0 (h3 )q∗ + Wi2 f 0 (h2 ) h1 µ ∗t−1 ¶ − q∗m−1 µ ˜ ˜ µ q = ηq δi W f h1 1 − q∗ h µ µ µ +ηδi Wit f 0 (ht )q∗t + Wi,t−1 f 0 (ht−1 )q∗t−1 + · · · + µ µ µ µ ¤ µ + δi Wi3 f 0 (h3 )q∗ + δi Wi2 f 0 (h2 ) h1
1389
(3.10)
Summarizing, we have the approximate proportionalities, ¡ ¢ 1qm,m−1 ∝ 0 1 − q∗ ³ ´ 1qm−1,m−2 ∝ 0 1 − q∗2 .. .
³ ´ 1qT,T−1 ∝ 0 1 − q∗m−T+1 .. .
¡ ¢ 1qt+1,t ∝ 0 1 − q∗m−t ¡ ¢ 1qt,t−1 ∝ 0 1 − q∗m−t q∗ .. .
¢ ¡ 1q21 ∝ 0 1 − q∗m−t q∗t−1
(3.11)
µ ˜ ˜ µ f hm /1 − q∗, and T > t corresponds to some neuron posiwhere 0 is ηδi W tion in the hidden layer. From equation 3.11, three categories clearly emerge. From hidden neuron 1 through t, the term inside the parentheses approaches 1, and the behavior of these neurons is dominated by the multiplicative terms q∗t−1 , . . . , q∗ , and differentiation occurs. From hidden neuron T through m, the expression approaches 1 gradually, due to (1 − q∗ ), . . . , (1 − q∗m−T−1 ). Consequently, differentiation again occurs. From hidden neuron t + 1 through T − 1, the updates have approached 1, since q < 1. Consequently, these neurons behave similarly (as in a herd). The input to hidden weights update provides similar results as far as the profile (behavior of the weights) in the update is concerned. The trend is noticed for each feature of the input vector, in that the hidden input weights from the kth feature position have a similar profile as the lateral weights. In detail, then, the hidden to input weights are similarly,
£ µ ¤ µ 1wmk = η δi Wim f 0 (hµ m ) ξk µ ∗¶ µ ˜ ˜ µ 1−q = ηδi W f ξk 1 − q∗ £ µ ¤ µ µ µ 0 µ ∗ 1wm−1,k = η δi Wim f (hm )q + δi Wi,m−1 f 0 (hm−1 ) ξk
(3.12)
1390
Kwabena Agyepong and Ravi Kothari
i h µ ˜ ˜ ∗ µ ˜ ˜ µ f q + δi W f ξk = η δi W µ ∗2 ¶ µ ˜ ˜ µ 1−q f ξk = ηδi W 1 − q∗
(3.13)
.. .
h µ µ µ ∗m−t−1 + δi Wi,m−1 f 0 (hm−1 )q∗m−t−2 1wt+1,k = η δi Wim f 0 (hµ m )q ¤ µ µ µ µ µ + · · · + δi Wi,t+2 f 0 (ht+2 )q∗ + δi Wi,t+1 f 0 (ht+1 ) ξk ´ ³ µ ˜ ˜ ∗m−t−1 µ = ηδi W + q∗m−t−2 + · · · + q∗ + 1 ξk f q µ ∗m−t ¶ µ ˜ ˜ µ 1−q f ξk = ηδi W 1 − q∗ h µ µ µ ∗m−t + δi Wi,m−1 f 0 (hm−1 )q∗m−t−1 1wtk = η δi Wim f 0 (hµ m )q µ
µ
µ
(3.14)
µ
+ · · · + δi Wi,t+2 f 0 (ht+2 )q∗2 + δi Wi,t+1 f 0 (ht+1 )q∗ µ µ ¤ µ + δi Wit f 0 (ht ) ξk ³ ´ µ ˜ ˜ ∗m−t µ f q = ηδi W + q∗m−t−1 + · · · + q∗2 + q∗ ξk µ
µ
+ηδi Wit ft0 ξk µ ∗ ∗m−t+1 ¶ µ ˜ ˜ µ q −q µ = ηδi W f ξk + ηδWit ft0 q∗ ξk 1 − q∗
(3.15)
.. .
h µ µ µ ∗m−2 + δi Wi,m−1 f 0 (hm−1 )q∗m−3 1w2k = η δi Wim f 0 (hµ m )q µ
µ
µ
µ
µ
µ
+ · · · + δi Wi,t+1 f 0 (ht+1 )q∗t−1 + δi Wi,t f 0 (ht )q∗t−2 µ
1w1k
µ
+δi Wi,t−1 f 0 (ht−1 )q∗t−3 + · · · + δi Wi,3 f 0 (h3 )q∗ µ µ ¤ µ + δi Wi2 f 0 (h2 )) ξk ³ ´ µ ˜ ˜ ∗m−2 µ µ£ µ f q = ηδi W + q∗m−3 + · · · + q∗t−1 ξk + ηδi Wit f(0 ht )q∗t i µ µ µ µ + Wi,t−1 f 0 ht−1 q∗t−1 + · · · + Wi3 f 0 (h3 )q∗ + Wi2 f 0 (h2 ) ξk µ ∗t−1 ¶ − q∗m−1 µ ˜ ˜ µ q = ηδi W f ξk 1 − q∗ h µ µ µ +ηδi Wit f(0 ht )q∗t + Wi,t−1 f 0 ht−1 q∗t−1 + · · · + µ µ ¤ µ (3.16) + Wi3 f 0 (h3 )q∗ + Wi2 f 0 (h2 ) ξk h µ µ 0 µ ∗m−1 0 µ ∗m−2 = η δi Wim f (hm )q + δi Wi,m−1 f (hm−1 )q µ
µ
µ
+ · · · + δi Wi,t+1 f 0 (ht+1 )q∗t + δWit f 0 (ht )q∗t−1 µ
µ
+δi Wi,t−1 f 0 (ht−1 )q∗t−2
Controlling Hidden Layer Capacity µ µ µ ¤ µ + · · · + δWi,2 f 0 (h2 )q∗ + δi Wi,1 f 0 (h1 ) ξk ³ ´ µ ˜ ˜ ∗m−1 µ f q = ηδi W + q∗m−2 + · · · + q∗t ξk h µ µ µ +ηδi Wit f 0 (ht )q∗t−1 + · · · + Wi,t−1 f 0 (ht−1 )q∗t−2 ¤ µ µ µ + Wi2 f 0 (h2 )q∗ + Wi,1 f 0 (h1 ) ξk µ ∗t ∗m ¶ µ ˜ ˜ µ q −q f ξk = ηδi W 1 − q∗ h µ µ +ηδi Wit f 0 (ht )q∗t−1 + · · · + Wi,t−1 f 0 h( t − 1µ )q∗t−2 µ µ ¤ µ +Wi2 f 0 (h2 )q∗ + Wi,1 f 0 (h1 ) ξk
1391
(3.17)
The correspondence in the form of the lateral weight updates (see equations 3.6 through 3.10) to the hidden to input weight updates (see equations 3.12 through 3.17) enables us to make an inference on the form of the profile. For the kth input, the update of all weight to the hidden neurons will preserve the activation profiles discussed above. We have shown how three distinct regions result in the profile of the lateral weights after the first update. Subsequent passes (directly driven by the output error size) serve to reinforce and expand the regions (except the herd); that is, the total number of hidden neurons added to the differentiated regions tends to increase. The forward pass after the first update produces weighted sums and activations of the hidden neurons that maintain the three regions. This is because neurons outside the herd are dependent on the input to hidden weights and the lateral weights, which have been shown to be differentiated. Thus, the parameters ensure a preservation of the regions (hidden neurons 1 through t, the herd t + 1 through T − 1, and T through m). During the weight updates, the differentiation in weights continues to be reinforced, resulting in an increase in number of the neurons that move away from the herd. This is explained by the fact that the weighted sum between adjacent hidden neurons is multiplied by a lateral weight. As these weights move away from the herd, the effect on the adjacent neuron becomes more pronounced. Thus, the neurons begin to move away from the herd. The hidden (T to m) to output neuron weights are not differentiated on the first update. After the first update, these weights start differentiating due to the differentiation of the lateral weights and also the hidden to input weights. 4 Simulations We present four simulations to illustrate various aspects of the proposed paradigm. In support of the analysis in the previous section, we present two simulations: a classification problem and a function approximation
1392
Kwabena Agyepong and Ravi Kothari
problem. The goal of these two simulations is to illustrate the selective specialization of the hidden-layer neurons from the fringes. Due to the effective reduction in the free parameters as a consequence of functionally identical hidden-layer neurons, one might anticipate better generalization. To illustrate this, we performed two additional simulations. The first of these latter two is a simple problem that clearly indicates better generalization while also allowing for easy visualization of the result. The second is based on the yearly sunspot data set. We also present comparative performance with backpropagation applied to a standard feedforward network (with no lateral connections). To make the comparisons meaningful, for the former two simulations we trained lateral backpropagation and backpropagation for the same number of epochs. Since the latter two simulations are intended to demonstrate generalization performance, we trained lateral backpropagation and backpropagation until both achieved the same error. 4.1 Simulation 1. Through this two-class classification problem, we show the selective differentiation of the hidden-layer neurons, as indicated by the analysis of the previous section. The decision boundary separating the two classes is specified by: −x + 30 f (x) = 0.2x2 − 6x + 50 x
, x ∈ [7.5, 5] , x ∈ [5, 25] , x ∈ [25, 35]
(4.1)
A network of size 2 × 15 × 1 was used (2 inputs, a hidden layer with 15 neurons, and 1 output). The hidden layer had 14 lateral weights. All initial network weights, excluding lateral weights, were initialized to 0.1. Lateral weights were initialized to 0.01. A learning rate of η = 0.1, ηq = 0.1 was used. All neurons used had the sigmoidal activation function. A total of 288 training patterns were used, and at the end of 1000 epochs, the network had learned to classify 92 percent of the patterns correctly. The desired classification is shown in Figure 2a, and the actual classification after 0, 300, 650, and 1000 epochs (see Figures 2b–e). The values of the network parameters at the end of these epochs are shown in Figure 3. In the first row, the values of the biases of the hidden neurons are shown at 0, 300, 650, and 1000 epochs. In the second row, the values of weights from the first input to the hidden neurons (wj1 ) are shown at 0, 300, 650, and 1000 epochs. In the third row, the values of weights from the second input to the hidden neurons (wj2 ) are shown at 0, 300, 650, and 1000 epochs. In the fourth row, the values of weights from the hidden to the output neurons (W1j ) are shown at 0, 300, 650, and 1000 epochs. In the fifth row, the values of lateral weights (qj,j−1 ) are shown at 0, 300, 650, and 1000 epochs. The final weights clearly show a herd located in the center of the network. Only 5 of the total available 15 hidden neurons were used differently from the herd.
Controlling Hidden Layer Capacity
1393
Figure 2: (a) Desired classification and the actual classification after (b) 0, (c) 300, (d) 650, and (e) 1000 epochs.
For comparison, we also present the results from running this same simulation with backpropagation on a purely feedforward network (with no lateral connections). We ran the network in two configurations: one with 15 hidden-layer neurons (the same as that used for lateral backpropagation) and the second with 5 hidden-layer neurons (the effective, i.e., differentiated, number of hidden neurons as determined above). We used an identical learning rate of η = 0.1, and trained the network in each case for 1000 epochs (the same as above). Since having equal weights results in backpropagation not learning at all, we randomly initialized the weights to lie between −0.5 and +0.5. Figure 4a shows the results of training backpropagation with 15 hidden-layer neurons, while Figure 4b shows the result of training with 5 hidden-layer neurons. The performance is roughly equivalent in both cases and worse than what was obtained with lateral backpropagation. 4.2 Simulation 2. In this simulation, we again show the selective differentiation of the hidden-layer neurons, as indicated by the analysis of the previous section; however, here we illustrate it using a function approxi-
1394
Kwabena Agyepong and Ravi Kothari
Figure 3: (a) Values of the bias, (b) weights from the first input, (c) weights from the second input, (d) weights from the hidden-layer neurons to the output, and (e) lateral weights. All are shown after 0, 300, 650, and 1000 epochs, respectively, for the classification problem of Figure 2a. Note the differentiation by the outermost hidden-layer neurons, while the hidden neurons in the center remain undifferentiated.
Figure 4: Actual classification for the classification problem of Figure 2a with a purely feedforward network, with (a) 15 and (b) 5 hidden-layer neurons.
Controlling Hidden Layer Capacity
1395
mation problem. The simulation is approximation of a decaying absolute valued sinusoidal function; 61 training points were used. A network of size 1 × 20 × 1 was used (1 input, a hidden layer with 20 neurons, and 1 output). The hidden layer had 19 lateral weights. All initial network weights excluding lateral weights were initialized to 0.1. Lateral weights were initialized to 0.01. A learning rate of η = 0.1, ηq = 0.1 was used. All neurons used had the sigmoidal activation function. The desired and actual approximation at 0, 6000, 9000, and 12,000 epochs are shown in Figure 5. In each of the plots, the function to be approximated is shown by solid lines, and the network approximation is shown by broken lines. The values of the network parameters at the end of these epochs are shown in Figure 6. In the first row, the values of the biases of the hidden neurons are shown at 0, 6000, 9000, and 12,000 epochs. In the second row, the values of weights from the input to the hidden neurons (wj1 ) are shown at 0, 6000, 9000, and 12,000 epochs. In the third row, the values of weights from the hidden to the output neurons (W1j ) are shown at 0, 6000, 9000, and 12,000 epochs. In the fourth row, the values of lateral weights (qj,j−1 ) are shown at 0, 6000, 9000, and 12,000 epochs. Notice again how the final weights clearly show a herd located in the center of the network. Theoretically, two sigmoids should be able to approximate each mode (hump) in this function. Only 9 of the total available 20 hidden neurons were used differently from the herd. Exact classification for the first simulation and extremely good approximation for this simulation were obtained when the network was trained beyond the epochs noted. The figures are derived from arbitrarily chosen points in the training cycle to show selective specialization of the hiddenlayer neurons. However, the observed characteristics replay constantly. For comparison, we also present the results from running the same simulation with backpropagation on a purely feedforward network (with no lateral connections). We ran the network in two configurations: one with 20 hidden-layer neurons (the same as that used for lateral backpropagation) and the second with 9 hidden-layer neurons (the effective, i.e. differentiated, number of hidden neurons as determined above). We used an identical learning rate of η = 0.1 and trained the network in each case for 12,000 epochs (the same as above). Since having equal weights results in backpropagation not learning at all, we randomly initialized the weights to lie between −0.5 and +0.5. Figure 7a shows the results of training backpropagation with 20 hidden-layer neurons, and Figure 7b shows the result of training with 9 hidden-layer neurons. The performance of lateral backpropagation was significantly better compared to each of the cases in Figure 7. Indeed, backpropagation was unable to approximate the third mode of the function at the end of 12,000 epochs. We have analyzed this behavior in some detail (Kothari & Agyepong, 1996; see also Fahlman & Lebiere, 1991), but, briefly, backpropagation has difficulty in approximating such a function because incorrect approximation of the third hump has a lower error (penalty) than incorrect approximation of the other two humps. Since the
1396
Kwabena Agyepong and Ravi Kothari
Figure 5: Approximation at (a) 0, (b) 6000, (c) 9000, and (d) 12,000 epochs (shown by broken lines). The function to be approximated is shown by solid lines.
hidden-layer neurons in a strictly feedforward network evolve independent of each other, all or most them independently evolve to reduce the larger sources of error. Only when the errors are comparable are the smaller sources of error targeted. The second hump in this case yields a smaller total error than the first hump, and the third hump yields the lowest total error. Consequently, more attention is initially paid to these humps. Such a situation also arises in the case of imbalanced classification problems (one class has many more training samples than another class). 4.3 Simulation 3. In contrast to the first two simulations, this simulation is intended to demonstrate better generalization in lateral backpropagation due to the automatic reduction in the effective number of free parameters. Toward that end, we consider a simple problem: approximating a function that is linear (see Figure 8). A network of size 1 × 12 × 1 was used (1 input, a hidden layer with 12 sigmoidal neurons, and 1 linear output neuron). Figure 8 shows that the training data points that the proposed algorithm (lateral backpropagation) was trained with. For comparison, we also ran backpropagation on a network with the same number of hidden neurons.
Controlling Hidden Layer Capacity
1397
Figure 6: (a) Values of the bias, (b) weights from the input, (c) weights from the hidden layer neurons to the output, and (d) lateral weights shown after 0, 6000, 9000, and 12,000 epochs, respectively, for the approximation problem of Figure 5. Note the differentiation by the outermost hidden-layer neurons while the hidden neurons in the center remain undifferentiated.
Figure 7: Approximation with a purely feedforward network with 20 and 9 hidden-layer neurons shown in panels (a) and (b), respectively. Compare with the performance of lateral backpropagation in Figure 5.
1398
Kwabena Agyepong and Ravi Kothari
Figure 8: Performance of lateral backpropagation (BP) and backpropagation on a network of 1 input, 12 sigmoidal hidden neurons, and 1 linear output neuron. The learning rate used was 0.1, and both networks were run to the same error (0.1). Note the considerably smoother approximation with the lateral connections, despite an abundance of free parameters.
A learning rate of 0.1 was used for both networks, and both were trained to the same sum of squared error (0.1). One may observe the considerably smoother approximation afforded by the presence of the lateral connections, despite the presence of far more parameters than are required. We also observed considerable variability in the generalization performance obtained with backpropagation with changing training parameters. The simulation nonetheless highlights the poor generalization performance that results with backpropagation, especially when there are an excessive number of free parameters. In contrast, the generalization performance of lateral backpropagation is much more robust. 4.4 Simulation 4. In this final simulation, we further illustrate the better generalization resulting from the use of the proposed lateral connections, using real-world data. We consider the yearly sunspot data from 1770–1920, as shown in Figure 9. We use the data from 1770 to 1869 for training and the data from 1870 to 1920 for ascertaining the generalization performance. Denoting the yearly sunspot data by x(t), the input to the networks consisted of x(t−1), and x(t−2), and the desired output was x(t). We considered networks
Controlling Hidden Layer Capacity
1399
Figure 9: Yearly sunspot data, 1770 to 1920. Data from 1770 to 1869 are used for training purposes, while data from 1870 to 1920 are used to ascertain the generalization.
with varying number of hidden-layer neurons and in each case compared the result to standard feedforward networks trained using backpropagation. For each comparison, we trained the network with backpropagation and with lateral backpropagation until they had the same sum-of-squared error on the training data (0.9). A learning rate of η = 0.1, ηq = 0.01 was used in all the sunspot simulations. Figure 10 summarizes the sum-of-squared error, on the generalization data, obtained using backpropagation and lateral backpropagation with different numbers of hidden-layer neurons. As expected, because of functionally identical neurons (only three neurons were used with lateral backpropagation) in the center of the hidden layer, the generalization performance of the proposed algorithm (lateral backpropagation) does not change substantially. In contrast, standard feedforward networks are extremely sensitive to the choice of the number of hidden-layer neurons. 5 Discussion and Conclusion We have shown that the use of lateral interconnections among the hiddenlayer neurons facilitates controlled assignment of role and specialization of the hidden-layer neurons. In particular, we showed that as training progresses, hidden neurons become progressively specialized—starting from the fringes and leaving the neurons in the center of the hidden layer unspe-
1400
Kwabena Agyepong and Ravi Kothari
Figure 10: Generalization performance of lateral backpropagation (BP) and backpropagation on the sunspot data, with varying number of hidden-layer neurons. Note the increasingly better performance of backpropagation with a reduced number of hidden-layer neurons. The proposed method (lateral backpropagation) is relatively insensitive to the choice of the number of hidden-layer neurons. Each network was trained to the same error on the training data.
cialized. Four simulations—two to demonstrate the selective specialization of hidden-layer neurons and two to demonstrate the better generalization resulting from functionally identical neurons in the center of the hidden layer—were presented. Comparative performance with backpropagation for standard feedforward networks was also presented. While criteria such as early stopping (stopping when the error on a validation data set starts increasing) can be used with standard backpropagation to increase the generalization performance, the proposed paradigm does not require independent validation data, which may be expensive to obtain or may not be available. Alternatively, in lateral backpropagation one can also stop the training when the herding effect in the hidden layer starts to diminish. The result of having identically behaving neurons in the central part of the hidden layer makes available a simple learning paradigm in which the effective number of free parameters is automatically reduced. In that sense, the proposed method is in the same spirit as the soft weight sharing scheme of Nowlan and Hinton (1992). However, the proposed method may also be viewed as a network growing algorithm without the explicit need to add neurons. In addition, the localization of this weight sharing at the hidden layer may allow procedural determination of the number of hidden
Controlling Hidden Layer Capacity
1401
layer neurons required. It may also be possible to replace the weight-shared portion of the network with a single neuron whose incoming and outgoing weights are set such that the network is identical in performance. Such replacements as training progresses may also provide for faster training with large networks. Acknowledgments This research was supported in part by a UC/OBR Research Challenge Grant. References Baum, E. B., and Haussler, D. (1989). What sized net gives valid generalization? Neural Computation 1(1):151–160. Bishop, C. M. (1995). Neural networks for pattern recognition. NY: Oxford University Press. Bose, N. K., & Garga, A. K. (1993). Neural network design using Voronoi diagrams. IEEE Transactions on Neural Networks 4(5):778–787. Chauvin, Y. S. (1989). A back-propagation algorithm with optimal use of hidden units. In D. S. Touretzky, (Ed.), Advances in Neural Information Processing Systems 2 (pp. 642–649). San Mateo, CA: Morgan Kaufmann. Cios, K. J., & Liu, N. (1992). A machine learning method for generation of a neural network architecture: A continuous ID3 algorithm. IEEE Transactions Neural Networks 3(2):280–291. Fahlman, S. E., & Lebiere, C. (1991). The cascade-correlation learning architecture. In D. S. Touretzky (Ed.), Advances in Neural Information Processing Systems 2 (pp. 524–532). San Mateo, CA: Morgan Kaufmann. Gallant, S. I. (1986). Three constructive algorithms for network learning. In Proc. of the Eighth Annual Conference of the Cognitive Science Society (pp. 652–660). Amherst, MA. Hassibi, B., & Stork, D. G. (1993). Second order derivatives for networks pruning: Optimal brain surgeon. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in Neural Information Processing Systems 5 (pp. 164–171). San Mateo, CA: Morgan Kaufmann. Karnin, E. D. (1990). A simple procedure for pruning back propagation trained neural networks. IEEE Transactions on Neural Networks 1(2):239–242. Kothari, R., & Agyepong, K. (1996). On lateral connections in feed-forward neural networks. In Proc. IEEE International Conference on Neural Networks (pp. 13– 18). Kwok, T., & Yeung D. (1995). Constructive feedforward neural networks for regression problems: A survey (Tech. Rep. HKUST-CS95-43). Hong Kong University of Science and Technology. LeCun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal brain damage. In D. S. Touretzky (Ed.), Advances in Neural Information Processing Systems 2 (pp. 598–605). San Mateo, CA: Morgan Kaufmann.
1402
Kwabena Agyepong and Ravi Kothari
Mozer, M. C., & Smolensky, P. (1989). Skeletonization: A technique for trimming the fat from a network via relevance assessment. In D. S. Touretzky (Ed.), Advances in Neural Information Processing Systems 1 (pp. 107–115). San Mateo, CA: Morgan Kaufmann. Nadal, J. P. (1989). Study of a growth algorithm for neural networks. International Journal of Neural Systems 1:55–59. Nowlan, S. J., & G. E. Hinton. (1992). Simplifying neural networks by soft weight sharing. Neural Computation 4(4):473–493. Plaut, D. C., Nowlan, S. J., & Hinton, G. E. (1986). Experiments on learning by backpropagation (Tech. Rep. CMU-CS-86-126). Pittsburgh, PA: Carnegie Mellon Univerisity. Reed, R. (1993). Pruning algorithms—A survey. IEEE Transactions on Neural Networks 4(5):740–747. Rumelhart, D. E., McClelland, J. L., & PDP Research Group. (1986). Parallel distributed processing: Explorations in the microstructure of cognition. Cambridge, MA: MIT Press. Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991). Generalization by weight elimination with application to forecasting. In R. P. Lippman, J. E. Moody, & D. S. Touretzky (Eds.), Advances in Neural Information Processing Systems 3 (pp. 875–882). San Mateo, CA: Morgan Kaufmann.
Received June 27, 1996; accepted October 22, 1996.
ARTICLE
Communicated by John Tsitsiklis
Mean-Field Theory for Batched TD(λ) Fernando J. Pineda Applied Physics Laboratory, Johns Hopkins University, Laurel, MD 20723, U.S.A.
A representation-independent mean-field dynamics is presented for batched TD(λ). The task is learning to predict the outcome of an indirectly observed absorbing Markov process. In the case of linear representations, the discrete-time deterministic iteration is an affine map whose fixed point can be expressed in closed form without the assumption of linearly independent observation vectors. Batched linear TD(λ) is proved to converge with probability 1 for all λ. Theory and simulation agree on a random walk example. 1 Introduction The general task of learning from delayed reinforcement is ubiquitous in nature, and it is one of the most difficult tasks in machine learning. One form of delayed reinforcement learning is the TD(λ) algorithm proposed by Sutton (1988). The algorithm attempts to learn a function f (.) that approximates the expected outcome of an absorbing Markov process. The algorithm observes the Markov process as a sequence of observation vectors, {xt }t=1,...,T , terminated by an outcome z. Loosely speaking, the algorithm tries to find a set of parameters, w, such that f (w, x) ≈ E{z|x}. TD(λ) solves the delayed reinforcement learning problem by enforcing a kind of continuity assumption: that temporally adjacent observations should result in similar predictions. A unique feature of the algorithm is a parameter λ for controlling the tradeoff between bias and variance. When λ = 1, variance is greatest, and the algorithm produces the same weight updates as the batched Widrow-Hoff algorithm. When λ = 0, bias is greatest, and in this case Sutton showed that if one assumes that the set of observation vectors is linearly independent and the learning rate is chosen small enough, then the predictions of TD(0) converge in expected value to the fully observed predictions: lim E{ f (w(p) , xi )} → E{z|xi } = E{z|i}.
p→∞
(1.1)
Dayan (1992) extended Sutton’s convergence result to general λ. He demonstrated that with no further assumptions, TD(λ) for 0 ≤ λ < 1 also converged in expected value. More recently, several investigators noted that stochastic approximation methods could be used to make stronger statements about convergence. In particular, Dayan and Sejnowski (1994) noted Neural Computation 9, 1403–1419 (1997)
c 1997 Massachusetts Institute of Technology °
1404
Fernando J. Pineda
that the conditions required to prove convergence in expected value, together with bounds on the variance of the predictions, were exactly what was required by theorems due to Kushner and Clark (1978) to prove convergence with probability one (w.p.1). Jaakkola, Jordan, and Singh (1994) proved a quite general convergence theorem and then used it to prove convergence w.p.1 of batched and on-line TD(λ). Tsitsiklis (1994) proved a similar theorem and noted that it could be applied to TD(λ). The assumption of linear independence of the observation vectors is equivalent to using a table look-up representation. Table look-up representations require memory that grows linearly with the number states in the Markov system. Compact or function-approximation representations, on the other hand, require only enough memory to store the adjustable parameters of a suitable approximation function. To use a compact representation, a learning algorithm needs to capture the underlying regularity of the task. The hope is that the learning algorithm need only observe a small fraction of the accessible states to generalize to unobserved states. In principle, the ability of compact representations to generalize enables learning machines to tackle otherwise intractable tasks. A powerful demonstration of generalization within the TD(λ) framework was provided by Tesauro (1992), who used TD(λ) with λ = 0 to train an agent on the task of learning to play backgammon. The agent handily defeated other automated backgammon players and recently achieve master-level play (Tesauro, 1995). Some theoreticians have tackled compact representations in the setting of Q learning (Singh, Jaakkola, & Jordan, 1995; Gordon, 1996), but until recently, little was known formally about TD(λ) in this setting. The principal technical contribution of this report is the elaboration of the mean-field dynamics of TD(λ) in the case of compact representations. I derive general representation-independent results and then apply these results to linear TD(λ). I derive a closed-form solution for the linear TD(λ) fixed point and prove convergence w.p.1 to this fixed point. The mean-field theory described in this article (except for the proof in section 4) was initially presented at Neural Networks for Physicists 5 (Pineda, 1995). Since that time, I have become aware of the independent work of Tsitsiklis and Van Roy (1996a, 1996b), who have developed a powerful framework for applying dynamic programming to compact representations. They recently applied their methodology to TD(λ). Some of their results are similar to those reported here except that they are derived in the context of infinite horizon Markov chains, while the results presented here are derived in the context of finite horizon Markov chains. It is also my understanding that Bertsekas and Tsitsiklas have a forthcoming treatment of the finite horizon case (Bertsekas and Tsitsiklas, 1996). In some ways, the framework presented here is more limited than the general on-line approach of Tsitsiklis and Van Roy. Nevertheless, the approach presented here is self-contained and should be useful for further investigation of TD(λ) in the absorbing Markov setting.
Mean Field Theory for Batched TD(λ)
1405
2 Preliminaries 2.1 Definitions. It is useful, at the outset, to define the notation that I shall use throughout. To this end, let the state space of a finite Markov process be the direct sum A ⊕ T of two state spaces. These are A, the set of absorbing states that has cardinality nA , and T, the set of transient states that has cardinality nT . A probability vector p is a vector with positive entries and unit norm. Assume that initially the states of the system are distributed with prior probabilities p(0) . At the conclusion of t state transitions, the states of the system are distributed with probabilities p(t) . The transition matrix P for the Markov chain transforms probability vectors at time t into probability vectors at time t + 1. The matrix element pij of P is the probability that the system transitions from state i to state j in one time step. P can be partitioned into four submatrices according to how each submatrix transforms states: µ ¶ I R P= . (2.1) 0 Q I is the identity matrix. Qij is the probability of making a transition from transient state j to transient state i. Rij is the probability of making a transition from transient state j to absorbing state i. The ith absorbing state is associated with a scalar “outcome” zi . Together, the outcomes define an nA -dimensional vector z. The observation is represented by a function Ä(.) that maps states to nw dimensional observation vectors, that is, xi = Ä(i). The observation matrix X is the (nw ×nT )-dimensional matrix whose columns consist of observation vectors. The observation function, Ä, need not be invertible. This allows for the possibility that the Markov process is partially observed. Moreover, if the function Ä is stochastic, then we are operating within a hidden Markov setting. This is an obvious generalization of the approach described in this article. I do not treat this case explicitly so as to keep the presentation simple. The Markov process induces an ensemble of random sequences. A randomly drawn sample from this ensemble is a sequence s that consists of T observations and one outcome. The length of the sequence T + 1 is a stochastic variable. The tth transient state in the sequence is denoted by s(t). It corresponds to an observation xs(t) . The last state in the sequence is s(T+1) and corresponds to an outcome zs(T+1) . All expectations that are taken in this article assume that states are sampled with frequencies determined by {Q, R, p}. Under this assumption, it is a standard exercise to show that the expected outcome of the process, given that it is known to be in state i, is E{z|i} = [zT R(1 − Q)−1 ]i .
(2.2)
In the following discussion, I will use (T ) to denote the transpose operator in the space of observations and († ) to denote the transpose operator in the space of weights.
1406
Fernando J. Pineda
2.2 Mean Field Setting. Batched-TD(λ) has the form w(n+1) = w(n) + α (n+1) F(s(n+1) , w(n) ),
(2.3)
where (n+1)
F(s
(n)
,w
¯ T t ¯ X X ¯ t−τ )= ( fs(t+1) − fs(t) ) λ ∇ fs(τ ) ¯ ¯ t=1 τ =1
,
(2.4)
w=w(n)
and where, for notational simplicity, I have defined fs(t) = f (w, xs(t) ). f (., .) is a deterministic function of the nth input sequence s(n) and weight vector w(n) . The sequences are drawn at random from a distribution determined by the absorbing Markov process. Thus, at every point w in the weight space, there exists an ensemble of weight updates. Each weight update corresponds to a particular sequence. The nth expected weight update, averaged over the ensemble of sequences, is written as h1w(n) i = α (n+1) hF(s, w(n) )i ≡ α (n+1) F(w(n) ). The manifold of averaged weight update vectors defines a mean-field F on the parameter space. This mean field is completely determined by the statistical properties of the Markov process, by the approximation function f (.) and by the observables X. F induces a continuous-time mean-field dynamical system according to dw = F(w). dt
(2.5)
A discrete-time system can be derived from the continuous-time system by taking dw/dt → 1w/1t and defining 1t(n) = α (n) . The result is w(n+1) = w(n) + α (n+1) F(w(n) ).
(2.6)
Either the continuous or the discrete-time deterministic dynamics can be used to study batched-TD(λ). 2.3 Expected Values of Partial Sums. To calculate F(w) for batchedTD(λ), it is useful to characterize the mean and variance of partial sums of the form σ =
T t X X (g(xs(t+1) ) − g(xs(t) )) λt−τ h(xs(τ ) ), t=1
(2.7)
τ =1
where g(.) and h(.) are arbitrary functions of observation vectors. The first result below is an expression for the mean of the partial sum; the second result is an upper bound for the variance of the partial sum.
Mean Field Theory for Batched TD(λ)
1407
Result 1. Consider an absorbing Markov process determined by {Q, R, p, z, X}. Let g(.) and h(.) be arbitrary functions of the observation vectors. Moreover, let λ be a constant that satisfies 0 ≤ λ ≤ 1. Then, the expected value of the partial sum (see equation 2.7) is E{σ } = (−gT K(λ) + zT J(λ))h, where h is an nT -dimensional column vector whose ith component is hi = h(xi ), where gT is an nT -dimensional row vector whose ith component is gi = g(xi ), and where K and J are defined according to K(λ) ≡ (I−Q)(I−λQ)−1 D and J(λ) ≡ R(I − λQ)−1 D. The matrix D is a diagonal matrix whose ith elements are simply the average number of times that the ith state is visited in a sequence. Proof. See the appendix. Result 2. The variance E{σ 2 } is bounded by E{σ 2 } ≤ |h|2 (a + b|g| + c|g|2 ) where a, b, and c are positive constants. P P Proof. The second moment of σ = Tt=1 (g(xs(t+1) )−g(xs(t) )) tτ =1 λt−τ h(xs(τ ) ) has the structure X Gµν hσ (µ) · σ (ν) i, hσ 2 i = µ<ν≤3
where G is a function of λ and σ (µ) are partial sums defined in the appendix. Each of the six terms in hσ 2 i will have the general structure X X X µν µν µν hσ (µ) · σ (ν) i = hj hl Ai + gi hj hk Bijk + gi hj gk hl Cijkl ij
ijk
ijkl
where A, B, and C are proportional to a geometric power series in Q and λ. Each series will converge because limn→∞ Qn = 0 and because 0 ≤ λ ≤ 1. Application of the triangle and Schwartz inequalities immediately results in the required upper bound, where a, b, and c are positive constants that can be calculated from G, A, B, and C. The fact that a, b, and c are finite is a direct consequence of the facts that A, B, and C are defined by convergent power series in Q and λ and that the state space of the Markov process is finite. 3 Mean-Field Theory 3.1 General Representations. Starting from an arbitrary compact representation f (w, x), the mean field is computed from equation 2.4 by applying
1408
Fernando J. Pineda
result 1 with [g]i ≡ fi and [hj ]i ≡ ∂ fi /∂wj to each component of F(s(n+1) , w(n) ). The resulting continuous-time mean-field dynamical equation is dw = (−fT K(λ) + zT J(λ))∇ † f, dt
(3.1)
where w is an nw -dimensional column vector and ∇ = (∂/∂w1 , . . . , ∂/∂wnw ) is an nw -dimensional row vector operator. The transpose of ∇ in the weight space is a row vector operator denoted by ∇ † . This representation-independent dynamical equation can also be expressed in terms of E{z|i}. To this end, define the vector e whose ith component is [e]i = E{z | i}.
(3.2)
In terms of e, equation 3.1 takes the somewhat simpler form (see, e.g. Dayan, 1992), dw = (e − f)T K(λ)∇ † f. dt
(3.3)
Although the two forms of the dynamical equation are equivalent, we find it more convenient to use equation 3.1. From equation 3.1 we observe that the fixed points of the mean-field equations satisfy ¯ ¯ (−fT K(λ) + zT J(λ))∇ † f¯
w=w∗
= 0.
(3.4)
The local properties of the flow are characterized by the Jacobian matrix Hij ≡
∂Fi = [∇ † F† ]ij , ∂wj
(3.5)
whose explicit form is H = −(∇ † fT )K(λ)∇f + zT J(λ)∇ † ∇f,
(3.6)
where [∇ † ∇]ij ≡ ∂ 2 /∂wi ∂wj . If the Jacobian matrix, evaluated at the fixed point, is negative definite, then all the eigenvalues of H are negative, and the fixed point is locally stable. Moreover, if the Jacobian is everywhere negative definite, the flow is globally contracting, and there is a unique stable fixed point. For arbitrary representations, little can be said analytically concerning the properties of H. On the other hand, for linear representations, it is possible to say much more. In the next section the case of linear representations is investigated in more detail.
Mean Field Theory for Batched TD(λ)
1409
3.2 Linear Representations. In the case of linear representations, we have f = XT w, ∇f = XT , and ∇ † ∇f = 0. We assume nothing about the observation function Ä. Accordingly, the Jacobian simplifies to H(lin) ≡ −(∇ † fT )K(λ)∇f.
(3.7)
This leads immediately to result 3. Result 3. Let f = XT w. Then a sufficient condition for the flow of the TD(λ) mean-field equations to be globally contracting is Rank(X) = nw . Moreover, since the flow is globally contracting, there exists a unique stable fixed point. Proof. To prove the flow is globally contracting, it suffices to prove that H(lin) is negative definite. To prove that H(lin) is negative definite, recall that if an arbitrary matrix A in Rn×n is positive definite and an arbitrary matrix 3 in Rn×m has rank m, then 3T A3 in Rm×m is also positive definite (see, e.g., theorem 4.2.1 Golub & Van Loan, 1996). Thus, if the (nw × nT )-dimensional matrix ∇f has rank nw , then H(lin) = −(∇ † fT )K(λ)∇f is negative definite because K is positive definite (Dayan, 1992; Dayan & Sejnowski, 1994). Thus the flow is globally contracting. The fact that the flow is globally contracting allows us to apply the contraction mapping theorem to prove the existence and uniqueness of a fixed point. The proof of this result relies on the claim that the matrix K is positive definite. This condition is violated by learning algorithms that sample the state space with frequencies that are different from that of the underlying Markov process (see, e.g., Tsitsiklis & Van Roy, 1996b). This is not a problem in the present setting since the visitation frequencies are determined by the transition and prior probabilities {Q, R, p} of the to-be-predicted Markov process. The assumption that ∇f has rank nw is not very stringent since it is violated only if the row rank of X is less than nw . Such a circumstance is easy to concoct by choosing a malicious representation. (For example, Rank(X) = 1 if all the rows of X are identical.) On the other hand, a randomly selected linear representation will, with probability 1, have Rank(X) = nw . It is useful for numerical studies and for the convergence proof in the next section to write down the linear dynamical equations explicitly: dw = −A(λ)w + b(λ), dt
(3.8)
where A(λ) = XKT (λ)XT
(3.9)
1410
Fernando J. Pineda
and where b(λ) = XJT (λ)z.
(3.10)
The unique stable fixed point is w∗ = A−1 (λ)b(λ),
(3.11)
where, as we have already determined, the invertibility of A(λ) is a trivial consequence of the fact that A(λ) is positive definite. From expression 3.11 and the explicit forms of K and J, we can write down a closed-form expression for the prediction vector to which mean-field TD(λ) converges: f = XT w∗ = XT (XKT (λ)XT )−1 XJT (λ)z.
(3.12)
This expression is valid provided Rank(X) = nw . Finally, we consider what can be said if we assume the Q matrix is symmetric. In this case it is possible to write down a Lyapunov function for equation 3.8 since if Q is symmetric, then K is also symmetric and its eigen√ √ T values √ are real and positive. It follows that we can write KT = K K where K is a symmetric, √ positive definite and (nT × nT )-dimensional matrix. In terms of BT ≡ X K, the fixed-point equation becomes −(BT B)w + BT y = h1wi
(3.13)
where √ y ≡ ( K)−1 JT z
(3.14)
F(w) = −∂E/∂wT = −BT (Aw − y)
(3.15)
or
where the Lyapunov function, E, is given by E = (Bw − y)T (Bw − y).
(3.16)
4 Convergence with Probability 1 Stochastic convergence of batched linear-TD(λ), is proved by making use of a standard theorem (e.g. Benveniste, M´etivier, & Priouret, 1990). The theorem follows, stated in terms of the notation used in this article. Theorem 1. Consider the random process w(n) = w(n−1) + α (n) F(n) (w(n−1) , s(n) ). Let the history of this process be denoted by 0 (n) ≡ {{w(k) }k≤n , {α (k) }k≤n ,
Mean Field Theory for Batched TD(λ)
1411
{F(k) }k≤n }. The random process converges to zero w.p.1 provided the following assumptions hold: A.1 The state space of the process is finite. P P (n) 2 = ∞, n α (n) < ∞ . A.2 nα A.3 (w − w∗ )T F(w) < 0 for w 6= w∗ and F(w∗ ) = 0. A.4 Var{F(n) | s(n) , 0 (n−1) } ≤ β(1 + kw(n−1) − w∗ k)2 where β is some positive constant. A.1 and A.2 are satisfied by construction. A.3 follows from the fact that F(w) = −Aw + b and that A is positive definite. Finally, to show that A.4 is satisfied, we recall result 2: Var{F(n) | s(n) , 0 (n−1) } ≤ |∇ f (n−1) |2 (a + b| f (n−1) | + c| f (n−1) |2 ), which, upon specializing to the linear case, yields the upper bound Var{F(n) | s(n) , 0 (n−1) } ≤ (A + Bkw(n−1) k + Ckw(n−1) k2 ) for some positive constants A, B, and C. This immediately implies that A.4 is satisfied. To summarize, we have satisfied all the assumptions of theorem 1 and have thereby proved that batched linear TD(λ) converges w.p.1. It is also amusing to observe that because a randomly selected linear representation will, w.p.1, have Rank(X) = nw it follows that a randomly chosen linear representation converges w.p.1. 5 Numerical Solutions In this section, I numerically evaluate fixed points for the task of learning to predict the outcome of a random walk, and I verify that these fixed points agree with the results of actual TD(λ) learning trials. The system consists of a one-dimensional random walk with five transient states and two absorbing barriers. Associated with each transient state is a probability, p, of stepping to the right and a probability (1 − p) of stepping to the left. The states i = 0 and i = 6 are absorbing and are associated with outcomes z = 0 and z = 1 respectively. The heavy line in Figure 1 shows the expected outcome E{z | i} for each of the five transient states under the assumption that p = 0.35. The two dotted lines show the expected errors, E{z | xi }, obtained by linearTD(λ) under the assumption that the observation vectors are scalars defined according to xi = i. The left-most states are visited more often than the right-most states, so the algorithm weights these more heavily. This effect is greatest for the λ = 1 (Widrow-Hoff) case and not as strong for the λ = 0 case. Clearly, the linear representation does not represent the exact solution very well. The effect of different values of λ on the asymptotic least mean square (LMS) error is shown in Figure 2. The solid line is the analytic solution. The
1412
Fernando J. Pineda
Figure 1: The exact solution of the random walk for p = 0.35 is shown along with the analytically calculated mean field solutions.
plotted points are each averaged over ten learning trials. In each trial, the learning rate α (n) was annealed over 105 iterations from an initial value of 5 α (1) = 0.001 to a final value of α (10 ) = 0.000005. The analytically calculated fixed points and the solutions found by batched-TD(λ) were used to calculate the expected LMS error. The results demonstrate excellent agreement between the analytically calculated asymptotic error and the results of the learning trials. Software, written in C, for reproducing these results can be downloaded from http://olympus.ece.jhu.edu/∼fernando/. 6 Discussion and Conclusions The mean-field theory presented in this article is potentially a powerful approach for gaining insight into the generalization properties of linear and nonlinear TD(λ). Although little can be said analytically for nonlinear representations, it is clear that the nonlinear deterministic mean-field equations can be solved numerically for particular representations and for particular learning tasks. The transient properties of TD(λ) can be investigated by studying the trajectories of the deterministic iterates. Moreover, the methods used to calculate the mean field can be applied to the evaluation of the fluctuations about the mean. The calculation of fluctuation terms can be expected to lead to a more detailed understanding of the generalization properties of TD(λ). Indeed, studies of this nature have been performed recently by Singh and Dayan (1996). These authors computed analytic learning curves for the linear table look-up case. Their results show how judicious choice of
Mean Field Theory for Batched TD(λ)
1413
Figure 2: Expected error versus λ for the analytic solution and the empirically obtained solution.
λ can lead to transient solutions whose LMS error is systematically lower than the Widrow-Hoff (λ = 1) solution. Appendix This appendix shows how to average a particular form of partial sum over an ensemble of sequences generated by an absorbing Markov process. The ensemble is denoted by {s}. A randomly drawn sample from this ensemble is a sequence s that is T +1 states long. The length of the sequence is random. The tth state in the sequence is denoted by s(t). Thus s(1) is the first state, and s(T + 1) is the absorbing state. The partial sums I consider have the general form σ =
T t X X (g(xs(t+1) ) − g(xs(t) )) λt−τ h(xs(τ ) ), t=1
(A.1)
τ =1
where 0 ≤ λ ≤ 1 and g(x) and h(x) are arbitrary functions of observation vectors. The observation vector xs(t) is finite dimensional and represents an observation made when the Markov system is in state s(t). For notational simplicity, it is useful to use fα in place of f (xα ) where α is any subscript. Much of the development that follows makes extensive use of the Kronecker-δ symbol. Two Kronecker-δ symbols are defined. One applies to transient states, and one applies to absorbing states. In particular, ½ 1 if j = s(t) and t ≤ T t . (A.2) δj = 0 otherwise
1414
Fernando J. Pineda
This Kronecker-δ is simply a representation of the trajectory of a Markov chain through state space. It is equal to 1 if the system is in transient state s(t) at time t and zero otherwise. It has the following useful properties: T X
δjt = nj = the number of times state j occurs in s.
(A.3)
t=1 T X
δjt =
∞ X
t=1
X
j
δjt (since δt is zero if j 6∈ T)
(A.4)
t=1
δjt = 1
(A.5)
j∈T
(since the Markov process is in a unique state at each instant) X f (xs(t) )δjt = f (xj ).
(A.6)
j∈T j
The second Kronecker-δ symbol is denoted δ t and is defined by ½ 1 if absorbs in state s(j) at time t t δj = 0 otherwise.
(A.7)
It is zero except when the system is in the absorbing state at time T + 1. It has properties similar to the first Kronecker-δ. The Q matrix is the propagator for multistep transition probabilities. In particular Qjk is the probability of making a one-step transition from state k to state j, while [Qn ]jk is the probability of making a transition from state k to state j in exactly n steps. It is a standard textbook exercise to show that [(1 − Q)−1 ]jk is the probability of making a transition from state k to state j in any number of transitions. Armed with these preliminary notions, we begin the calculation by reorganizing the partial sum (see equation A.1) into the sum of three terms as follows: σ = σ (1) + (1 − λ)σ (2) + σ (3)
(A.8)
where σ (1) = −
T X
gs(t) hs(t)
(A.9)
t=1
σ (2) =
T−1 X τ =1
λτ −1
T X t=τ +1
gs(t) hs(t−τ )
(A.10)
Mean Field Theory for Batched TD(λ)
σ (3) = gs(T+1)
T X
1415
λT−τ hs(τ ) .
(A.11)
τ =1
We now proceed to calculate individually the average value of each of these partial sums. Calculation of hσ (1) i. One starts by rewriting equation A.9 in terms of the Kronecker-δ. The result is σ (1) = −
∞ X X
gj hj δjt = −
t=1 j∈T
X
gj hj
∞ X
δjt .
t=1
j
P j The expression ∞ t=1 δt is the number of times that the jth transient state appears in the sequence. Thus, to average this expression over the ensemble of finite Markov chains, we need to calculate the expected number of times that a transient state appears in sequence. This textbook calculation proceeds as follows: ∞ ∞ XX X hδjt i = {1 · [Qt−1 ]jk + 0 · (1 − [Qt−1 ]jk )}pk t=1
k∈T t=1
=
∞ XX
t−1
[Q
]jk pk =
k∈T t=1
X· k∈T
1 1−Q
¸ pk
(A.12)
jk
≡ dj . Accordingly, we obtain hσ (1) i = −
X
gj hj dj
(A.13)
j
Calculation of hσ (2) i. One first rearranges the order of the terms in equation A.10 so as to obtain σ (2) =
t−1 T X X
λτ −1 gs(t) hs(t−τ ) .
(A.14)
t=2 τ =1
This can be expressed in terms of the Kronecker-δ as σ (2) =
t−1 T X XX ij∈T t=2 τ =1
λτ −1 gi hj δit δjt−τ .
(A.15)
1416
Fernando J. Pineda
Now observe that the upper limit to the sum over t can be continued to ∞ because the Kronecker-δ is zero for t > T. Thus hσ (2) i is hσ (2) i =
t−1 ∞ X XX ij∈T t=2 τ =1
λτ −1 g(xi )h(xj )hδit δjt−τ i.
(A.16)
The expected value of the product of Kronecker-δ’s is X [Qτ ]ij [Qt−τ −1 ]jk pk . hδit δj1−τ i =
(A.17)
k∈T
If we substitute equation A.17 back into equation A.16, we obtain " # t−1 ∞ X X 1 X (2) τ t−τ −1 [(λQ) ]ij [Q ]jk . gi hj pk hσ i = λ ijk∈T t=2 τ =1
(A.18)
Notice that the summations over time are simply a geometric series of the form j−1 ∞ X X
ak b j−k−1 =
j=2 k=1
a . (1 − a)(1 − b)
(A.19)
Upon taking equation A.19 into account, we find that equation A.18 becomes " # ¸ X· ¸ · X 1 1 (2) Q pk gi hj hσ i = 1 − λQ ij k∈T 1 − Q jk ij∈T " ¸ # · X 1 Q dj . gi hj (A.20) = 1 − λQ ij ij∈T Calculation of hσ (3) i. Note that equation A.11 can be expressed in terms of Kronecker-δ’s as σ (3) = gs(T+1)
T X
λT−τ hs(τ ) =
X
τ =1
gi hj
i∈A i∈T
T X
T+1 t δj .
λT−τ δ i
(A.21)
t=1
Now, if we reorder the temporal summation in equation A.21 and continue the upper limit of the first sum to ∞, we find that hσ (3) i has the form hσ (3) i =
T ∞ X XX i∈A i∈T
T=1 t=1
T+1 t δj i.
gi hj λT−t hδ i
(A.22)
Mean Field Theory for Batched TD(λ)
1417
By applying the same arguments as used to calculate equation A.17, we find that T T X ∞ X ∞ X X X T+1 T+1 hλT−t δ i δjt i = hλT−t δ i δjt ipk T=1 t=1
T=1 t=1 k∈T T ∞ X X X = [R(λQ)T−1 ]ij [Qt−1 ]jk pk . T=1 t=1
(A.23)
k∈T
As before, the double temporal sums are a geometric series. This time the form is j ∞ X X
a j−k bk−1 = (1 − a)−1 (a − b)−1 .
(A.24)
j=1 k=1
Substituting equation A.24 into equation A.23, we obtain T ∞ X X
T+1 t δj }
E{λT−t δ i
· = R
T=1 t=1
·
1 1 − λQ
1 = R 1 − λQ
¸ X· ¸
ij k∈T
1 1−Q
¸ pk jk
dj . ij
So hσ (3) i becomes hσ (3) i =
XX
· gi hj R
i∈A j∈T
1 1 − λQ
¸ dj .
(A.25)
ij
The final result is obtained by substituting equations A.13, A.20, and A.25 into A.8. The result is hσ i = −
X
gi K(λ)ij hj +
XX
zi J(λ)ij hj ,
(A.26)
dj
(A.27)
i∈A j∈T
ij∈T
where ¸
·
1 Q K(λ)ij = I − (1 − λ) 1−λ and where
· J(λ)ij ≡ R
1 1 − λQ
¸ dj . ij
ij
1418
Fernando J. Pineda
Acknowledgments I thank Philippe de Forcrand and Sebastian Seung for insightful comments and suggestions at Neural Networks for Physicists 5. I thank Tommi Jaakkola for educational correspondence. Peter Dayan also educated me and brought to my attention the recent work of Tsitsiklis and Van Roy. Finally, I thank Ben Van Roy and an anonymous reviewer for pointing out an error in the stochastic convergence proof in section 4. References Benveniste, A., M´etivier, M., & Priouret, P. (1990). Adaptive Algorithms and Stochastic Approximations. Berlin: Springer-Verlag. Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Belmont, MA: Athena Scientific. Dayan, P. (1992). The convergence of TD(λ) for general lambda. Machine Learning, 8, 341–362. Dayan, P., & Sejnowski, T. J. (1994). TD(λ) converges with probability 1. Machine Learning, 14, 295–301. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (3rd ed.). Baltimore: Johns Hopkins University Press. Gordon G. J. (1996). Stable fitted reinforcement learning. In G. Tesauro, D. Touretzky, & T. Lean (Eds.), Advances in Neural Information Processing Systems: Proceedings of the 1996 Conference (pp. 1052–1058). Cambridge, MA: MIT Press. Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6, 1185– 1201. Kushner, H. J., & Clark, D. S. (1978). Stochastic approximation methods for constrained and unconstrained systems. New York: Springer-Verlag. Pineda, F. J. (1995, August 23–26). Generalization in TD(λ). Theoretical Physics Institute, University of Minnesota. Singh, S. P., & Dayan, P. (1996). Analytical mean squared error curves in temporal difference learning. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in Neural Information Processing Systems, 9. Cambridge, MA: MIT Press. Singh, S. P., Jaakkola, T., & Jordan, M. (1995). Reinforcement learning with soft state aggregation. In G. Tesauro, D. Touretzky, & T. Lean (Eds.), Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference (pp. 361–368). Cambridge, MA: MIT Press. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44. Tesauro, G. (1992). Practial issues in temporal difference learning. Machine Learning, 8, 257–277. Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM, 38, 58–68. Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning. Machine Learning, 16, 185–202.
Mean Field Theory for Batched TD(λ)
1419
Tsitsiklis, J. N., & Van Roy, B. (1996a). Feature-based methods for large scale dynamic programming. Machine Learning, 22, 59–94. Tsitsiklis, J. N., & Van Roy, B. (1996b). An analysis of temporal-difference learning with function approximation. (Tech. Rep. No. LIDS-P-2322). Cambridge, MA: MIT Laboratory for Information and Decision Systems.
Received May 6, 1996; accepted November 7, 1996.
LETTERS
Communicated by Shun-ichi Amari
Redundancy Reduction and Independent Component Analysis: Conditions on Cumulants and Adaptive Approaches Jean-Pierre Nadal Laboratoire de Physique Statistique, Ecole Normale Sup´erieure, Paris, France
Nestor Parga Departamento de F´ısica Te´orica, Universidad Aut´onoma de Madrid, Canto Blanco, Madrid, Spain
In the context of both sensory coding and signal processing, building factorized codes has been shown to be an efficient strategy. In a wide variety of situations, the signal to be processed is a linear mixture of statistically independent sources. Building a factorized code is then equivalent to performing blind source separation. Thanks to the linear structure of the data, this can be done, in the language of signal processing, by finding an appropriate linear filter, or equivalently, in the language of neural modeling, by using a simple feedforward neural network. In this article, we discuss several aspects of the source separation problem. We give simple conditions on the network output that, if satisfied, guarantee that source separation has been obtained. Then we study adaptive approaches, in particular those based on redundancy reduction and maximization of mutual information. We show how the resulting updating rules are related to the BCM theory of synaptic plasticity. Eventually we briefly discuss extensions to the case of nonlinear mixtures. Throughout this article, we take care to put into perspective our work with other studies on source separation and redundancy reduction. In particular we review algebraic solutions, pointing out their simplicity but also their drawbacks. 1 Introduction Recently many studies have been devoted to the study of sensory coding following a general framework initiated by H. Barlow more than thirty years ago (Barlow, 1961). The general idea is to define a cost function based on the properties one thinks a neural code should satisfy. Then, given a neural architecture with a simple enough neuron model, one derives the parameters of the network (synaptic efficacies, transfer functions, etc.) that minimize this cost function. A great deal of work has been done on cost functions based on information-theoretic criteria (Barlow, Kaushal, & Mitchison, 1989; Linsker, 1988; Atick, 1992). The result for the receptive fields will crucially Neural Computation 9, 1421–1456 (1997)
c 1997 Massachusetts Institute of Technology °
1422
Jean-Pierre Nadal and Nestor Parga
depend on the statistical properties of the signal to be processed (visual scenes, olfactory stimuli, etc.). Several cost functions have been proposed. One is based on Barlow’s original idea (Barlow, 1961), which is that redundancy reduction should be performed. This leads to the notion of factorial code (Barlow et al., 1989; Redlich, 1993): in the pool of neurons defining the neural code (the output neurons), each neuron should code for features statistically independent of the features encoded by the other neurons. Redundancy reduction has been explored in detail in the case of visual processing, with a linear neural network model (Atick, 1992). A different proposal has been promoted by Linsker (1988) under the name of the infomax principle: one asks the network to maximize the mutual information between the input (the signal received onto the receptors) and the output (the neural code). Again, most studies have been performed for linear networks in the context of visual processing (Linsker, 1988; van Hateren, 1992; Del Giudice, Campa, Parga, & Nadal, 1995). The predictions of these two strategies, redundancy reduction and maximization of mutual information, appeared to be very similar. In fact, we have shown (Nadal & Parga, 1994) that in the low synaptic noise limit with nonlinear outputs, infomax implies redundancy reduction, when optimization is done over both the synaptic efficacies and the nonlinear transfer functions. As a result, the optimal neural representation is a factorial code, for both cost functions in the limit of low processing noise. When noise is present, it can easily be shown (Atick, 1992) that pure redundancy reduction is not a meaningful strategy. Essentially, an efficient code is one that gives the independent features but add some redundancy to compensate for the noise. This can be studied in detail for a linear neural network (Del Giudice et al., 1995), where one can find, for instance, the number of meaningful principal components at a given noise level. In the case of signal processing and data analysis, however, the processing noise can indeed be neglected. It is then interesting to search for algorithms able to perform redundancy reduction and to determine the neural architectures best adapted to a given type of signal. Of particular interest is the case where a factorial code can be found by a feedforward network with no hidden layer. This implies that there exists a linear combination of the input that produces a factorial code. This arises when the input (the signal to be processed) itself is a linear mixture of statistically independent sources. Then finding the synaptic efficacies leading to a factorial code is equivalent, in the signal processing language, to finding the linear filter performing blind source separation (BSS). Since the pioneering work of Bar-Ness (1982) and Jutten and Herault (1991), interest in source separation has increased considerably (Comon, 1994; Cardoso, 1989; Molgedey & Schuster, 1994). The link between signal processing and sensory coding has been present in this field from the beginning. Jutten and Herault proposed a now wellknown heuristic for performing BSS using a neuromimetic architecture for analyzing a signal coming from muscles. Experience gained with the study of this algorithm led to new approaches to the study of the olfactory system
Redundancy Reduction and Independent Component Analysis
1423
(Hopfield, 1991; Rospars & Fort, 1994). On the neuropsychological side, it is involved in the “cocktail party effect” (being able to focus on a single speaker, separating his or her speech from all the ambient noise). In signal processing, BSS can be used for making cancellers, that is, systems for subtracting the noise from the signal (Bar-Ness, 1982). In fact, BSS has a wide variety of applications in data processing (see, e.g., Deville & Andry, 1995). In speech and sound processing, there is an additional complication: the signal is not simply a linear superposition of sources but also a convolution (Jutten & Herault, 1991). In this article, however, we will not consider the case of blind source deconvolution. While working on linear mixtures of sources, researchers soon realized that finding independent components might be a useful strategy for data processing. The term independent component analysis (ICA) was then promoted as a generalization of BSS to nongaussian data and as an alternative to the standard principal component analysis (PCA) (Jutten & Herault, 1991; Comon, 1994). To one of his papers on BSS, Comon (1994) has given the title: “Independent Component Analysis: A New Concept?” The answer to this question is no, since this concept was proposed by Barlow at the beginning of the 1960s, under the name of factorial coding. In fact, even earlier, Attneave (1954) had addressed it, although in a less precise way, under the name of economy of coding. Still, many years after Barlow suggested that factorial coding could be a major strategy used by nature for sensory coding, engineers arrived at the same conclusion in the context of signal analysis. In the same vein, it has been shown (Comon, 1994) that there is a particularly well-chosen cost function for BSS, and this special cost function is nothing but the redundancy reduction cost function mentioned above in the context of sensory coding. This convergence in the definition of relevant concepts may not seem surprising, but this was not always the case. In particular, Barlow (1961) put forth some stimulating discussions opposing nature’s and engineers’ approaches to coding. In the case of the BSS framework, the hypothesis is that linear processing is able to produce a factorial code. The main goal in BSS is then to find efficient algorithms in order to compute the filter (or synaptic efficacies) that leads to source separation. Besides heuristic algorithms such as the HeraultJutten (HJ) algorithm (Jutten & Herault, 1991), there are adaptive and gradient descent algorithms based on appropriate cost functions (Bar-Ness, 1982; Burel, 1992; Comon, 1994; Laheld & Cardoso, 1994; Bell & Sejnowski, 1995; Delfosse & Loubaton, 1995; Amari et al., 1996), and algebraic solutions (F´ety, 1988; Cardoso, 1989; Tong, Soo, Liu, & Huang, 1990; Molgedey & Schuster, 1994; Shraiman, 1993) in which the filter is algebraically derived from a particular set of measurements on the data. In this article we present new results on BSS. We deal with different mathematical aspects of BSS, keeping in mind the possible application to sensory systems modeling and the general framework of factorial coding and redundancy reduction. Because there is a widespread literature on this
1424
Jean-Pierre Nadal and Nestor Parga
subject that belongs to both sensory coding and signal processing, we found it necessary to put our contribution into perspective with previous works on source separation and redundancy reduction. In fact we will use a general framework—the reduction to the search for an orthogonal matrix (explained in section 2)—for both deriving the new results and presenting related work, sometimes in a simpler way as compared to the original formulation. In section 3, we give necessary and sufficient conditions for the network to be a solution of the ICA. These conditions are that a given small set of cumulants have to be set to zero. The number of cumulants that enter in any of these conditions is smaller than what is usually found in the literature. In section 4, we present a short review of known algebraic solutions, making use of the same framework. In the next two sections we deal with adaptive approaches. In section 5, after a reminder of the relationship between maximizing mutual information and minimizing redundancy (hence, building a factorial code), we discuss several possible approaches based on these information-theoretic concepts. We consider one of them in more detail and show how the resulting updating rules are related to the BCM theory of synaptic plasticity (Bienenstock, Cooper, & Munro, 1982). In section 6, we briefly discuss adaptive algorithms from cost functions based on cumulants, these costs being built from the conditions derived in section 3. In section 7 we also briefly discuss possible extensions to nonlinear processing. Perspectives are given in section 8. 2 Linear Mixtures of Independent Sources 2.1 Statement of the Problem. Here and in the rest of the article as well, we consider the case where the factorization problem has a priori an exact solution using a linear filter or a feedforward neural network with no hidden units (we will comment, whenever appropriate, on the possible extensions to cases where a multilayer network would be needed). This means that we are assuming the input data to be a linear mixture of independent sources. More precisely, we assume that at each time t, the observed data S(t) are an N-dimensional vector given by Sj (t) =
N X
Mj,a σa (t), j = 1, . . . , N,
(2.1)
a=1
(in vector form S = Mσ ) where the σa are N-independent random variables, of unknown probability distributions, and M is an unknown, constant, N×N matrix, called the mixture matrix. By hypothesis, all the source cumulants are diagonal, in particular the two-point correlation at equal time K0 : 0 ≡ hσa (t)σb (t)ic = δa,b Ka0 , Ka,b
(2.2)
where δa,b is the Kronecker symbol. Here and throughout this article hz1 z2 , . . . , zk ic denotes the cumulant of the k random variables z1 , . . . , zk .
Redundancy Reduction and Independent Component Analysis
1425
Without loss of generality, one can always assume that the sources have zero average: hσa i = 0, a = 1, . . . , N
(2.3)
(otherwise one has to estimate the average of each input and subtract it from that input). Performing source separation (or, equivalently, factorizing the output distribution) means finding the network couplings or synaptic efficacies J (the linear filter J in the language of signal processing), such that the Ndimensional (filter, or network) output h, hi (t) =
N X
Ji,j Sj (t), i = 1, . . . , N,
(2.4)
j=1
gives a reconstruction of the sources. Ideally, one would like to have J = M−1 . However, as it is well known and clear from the above equations, one can recover the sources only up to an arbitrary permutation and up to a signscaling factor for each source. Nothing tells us which output should be equal to which source. One cannot distinguish Mσ from M0 σ 0 with M0 = MΛ and σ 0 = Λ−1 σ , where Λ is any diagonal matrix having nonzero diagonal elements. In particular, one can fix the absolute value of the arbitrary diagonal 1 matrix by asking for J to be the inverse of MK0 2 , which is equivalent to stating that all we can expect is to estimate normalized cumulants of the sources, such as the k-order normalized cumulants ζka : ζka ≡
hσak ic k/2
hσa2 ic
, a = 1, . . . , N.
(2.5)
Any solution J that one may find will thus be equal to the inverse of 1 MK0 2 up to what we will call a sign permutation, that is, the product of a permutation by a diagonal matrix with only ±1 diagonal elements. As it is usually done in the study of source separation, we have assumed that the number of sources is known (we have N observations—e.g., N captors, for N independent sources), and we assume M to be invertible. The difficulty comes from the fact that the statistics of the sources are not known; the mixture matrix is not known and is not necessarily (and in general it is not) an orthogonal matrix. 2.2 Reducing the Problem to Finding an Orthogonal Matrix. An elementary but extremely useful remark, first made by Comon (1994) and Cardoso (1989), is that the search for a solution J can be reduced to the search for an orthogonal matrix. Indeed, the J we are looking for has to
1426
Jean-Pierre Nadal and Nestor Parga
diagonalize the two-point connected correlation of the inputs. As is well known, the diagonalization of a symmetric positive matrix defines a matrix only up to an arbitrary orthogonal matrix. Hence, performing first whitening will leave us with an orthogonal matrix, which has to be determined from higher cumulants. Because we will make extensive use of this fact, and in order to introduce our notation, we give a detailed derivation. Let us say that J sets the variance output to the N × N identity matrix 1N : hhhT ic = 1N .
(2.6)
This reads, in term of the correlation C0 between the input data,
C0 ≡ hSST ic = MK0 MT
(2.7)
JC0 JT = 1N .
(2.8)
as
Keep in mind that C0 is what one can measure, and the right-hand side of equation 2.7 is its expression in function of the unknowns. Every solution of equation 2.8 can be written as J = Ω C0 −1/2 ,
(2.9)
where Ω is an orthogonal matrix:
Ω ΩT = 1N .
(2.10)
The inverse of the square root of C0 is defined from the PCA of the real positive matrix C0 . If we denote by J 0 the matrix whose rows are the orthonormal eigenvectors of C0 , and by Λ0 the diagonal matrix of the associated eigenvalues, one has
J 0 C0 J 0T = Λ0 J 0 J 0T = 1N
(2.11)
and −1
C0 −1/2 = J 0T Λ0 2 J 0 .
(2.12)
Now suppose that one first projects the input data onto the normalized principal components, that is, one computes h0 defined by −1
h0 = Λ0 2 J 0 S.
(2.13)
Redundancy Reduction and Independent Component Analysis
1427
Then, taking J as in equation 2.9 and replacing C0 −1/2 by its expression (see equation 2.12), h can be written as −1
h = ΩJ 0T Λ0 2 J 0 S = ΩJ 0T h0 .
(2.14)
Since O ≡ ΩJ 0T is again an orthogonal matrix, finding J means finding an orthogonal matrix O such that the output h = O h0
(2.15)
is a vector of N mutually independent components. This means equivalently that going from S to h0 leads to the problem of separating an orthogonal mixture. Let us check that h0 is indeed an orthogonal mixture of the sources. First we write, in terms of the unknowns, that the rows of J 0 are the eigenvectors of C0 :
J 0 MK0 MT J 0T = Λ0 . −1
(2.16) 1
This implies that Λ0 2 J 0 MK0 2 is an (unknown) orthogonal matrix, which, for later convenience, we define as the transposed of some orthogonal matrix O0 : −1
1
Λ0 2 J 0 MK0 2 = O0T O O 0
0T
(2.17)
= 1N .
Hence, the projected data h0 as defined in equation 2.13 can be expressed, in function of the unknown sources, as an orthogonal mixture: h0 (t) = O0T K 0 −1/2 σ (t).
(2.18)
Eventually, from equations 2.15 and 2.18, one can also write h in terms of the sources as h = X K 0 −1/2 σ ,
(2.19)
where X is the orthogonal matrix OO0T . Clearly, the desired solution is, up to a sign permutation, O0 for O or, equivalently, 1N for X. In reducing the source separation problem to the search for an orthogonal matrix, we have seen that one can choose different, although equivalent, formulations. Each may be specifically useful depending on the particular approach. In section 3, where we will determine the family of couplings J
1428
Jean-Pierre Nadal and Nestor Parga
solution of a given set of equations, it will be convenient to use equation 2.19, taking X as unknown. For the algebraic approach in section 4, we will use only the quantities related to the principal components, that is, h0 and O0 with the basic equation 2.18. And in section 6 we will consider adaptive algorithms for computing either J itself or O as defined in equation 2.15. Throughout this article, we will mainly consider the ideal situation where one would have access to the exact values of the cumulants. Nevertheless, we can make some general remarks about the practical cases, where cumulants are computed empirically (e.g., as time averages performed over some large enough time window T). In practice, all that is required implicitly is to have for the sources small enough cross-cumulants when these cumulants are defined from the chosen averaging procedure. This has to be true only for the cumulants needed to compute in a given approach. We will not address the problem of the accuracy of the solution as a function of the quality of the estimation of cumulants. In many sensory coding and data analysis problems, PCA is a natural thing to do (clearly for close to gaussian data, but also for much more complex cases, as in the analysis of time series). In all cases, one can go beyond the strict PCA, using the freedom in the choice of the orthogonal matrix Ω. An important example is in the application of redundancy reduction to the modeling of the first stages in the visual system (Linsker, 1988; Atick, 1992; Li & Atick, 1993). There, after whitening (in fact, an extension of whitening that takes noise processing into account), this freedom is used to satisfy as much as possible some additional constraints, such as locality of receptive fields and invariance by dilatation. As a result, one finds (Li & Atick, 1994) a block-diagonal orthogonal matrix, which leads to a multiscale representation of the visual scenes.
3 Criteria Based on Cross-Cumulants of Output Activities This section presents two new necessary and sufficient conditions for J to be a solution of the ICA. These conditions are that a given set of cumulants, associated with correlations at equal time, is set to zero. For comparison we will also give other conditions based on correlations at equal time and on time correlations related to known algebraic solutions. Eventually we will comment on the relationship and differences with other criteria proposed in the literature.
3.1 Condition on a Set of Nonsymmetric Higher Cumulants. The claim is that for J to be a solution, it is sufficient (and necessary) that J performs whitening and sets altogether to zero a given set of cross-cumulants of some given order k, the number of which being only of order N2 . More precisely, we have the following theorem:
Redundancy Reduction and Independent Component Analysis
1429
Theorem 1. Let k be an odd integer at least equal to 3 for which the k-cumulants of the sources, ζka , a = 1, . . . , N, are not identically null; then: (i) If at most one of these k-cumulants is null, J is equal to the inverse of M (up to a sign permutation and a rescaling, as explained above), if and only if one has: for every i, i0 , ½
hhi hi0 ic = δi,i0 hh(k−1) hi0 ic = 0 for i 6= i0 . i
(3.1)
where h is the output vector as defined in equation 2.4. (ii) If only 1 ≤ L ≤ N − 2 k-order cumulants are nonzero, then any solution J of equation 3.1 is the product of a sign permutation by a matrix that separates the L sources having nonzero k-cumulants, and such that the restriction 1 of J MK0 2 to the space of the N − L other sources is still an arbitrary (N − L) × (N − L) orthogonal matrix. hi0 ic Note that the second line in equation 3.1 means that the matrix hh(k−1) i is diagonal but the values of the diagonal terms, to be named below 1i , need not be known in advance. The detailed proof is given in appendix A. Here we give only a sketch of it, in the case where all the k-cumulants are nonzero. First, solving for the diagonalization of the second-order cumulant, one uses the representation in terms of orthogonal matrices as in section 2. As a result, we have seen that one can write h as h = X K 0 −1/2 σ , where X is the unknown orthogonal matrix. Replacing h by this expression in the higherorder cumulants, one then uses several times the fact that X is orthogonal. In particular, because we are considering for each pair i, i0 a cumulant that is linear in hi0 , one can obtain an equation for each Xi,a separately. As a result, for each pair i, a, either Xi,a is zero or factorizes into the product of two terms, one depending only on i and one depending only on a (namely, the inverse of the cumulant ζka ). From the condition that XXT has zero offdiagonal elements, one then gets that X has on each row and each column a unique nonzero element, and this implies that X is a sign permutation. In the case of N − L ≤ N null cumulants, for a typical solution L outputs will then appear as still correlated with one another, yet uncorrelated with the other N − L outputs. One then has to try another value of k, but only for the subspace of yet unseparated sources. As expected, the conditions of application of the theorem are not satisfied, for any k, in the case of gaussian sources—for which all the cumulants of order higher than 2 are zero, and nothing more than whitening can be done. For k even, one can easily find an example showing that the conditions (see equation 3.1) are not sufficient (see appendix A).
1430
Jean-Pierre Nadal and Nestor Parga
An interesting application of this theorem concerns the adaptive algorithm in Jutten and Herault (1991). In its simplest version, this algorithm aims at setting to zero the two-point correlation and the cross-cumulants hh2i hi0 ic for i 6= i0 . If the algorithm does reach that particular fixed point, theorem 1 asserts that full source separation has been obtained. It is not difficult to find other families of cumulants of a given order k for which a similar theorem will hold. We illustrate this by giving an analogous result for a set of cumulants involving more indices. Theorem 2. Let k and m be two integers with m at least equal to 2 and k strictly greater than m, for which the k-cumulants of the sources, ζka , a = 1, . . . , N, are not identically null; then (i) If at most one of these k-cumulants is null, J is equal to the inverse of M (up to a sign permutation and a rescaling, as explained above), if and only if one has: for every i, i0 , i00 , ½
hhi hi0 ic = δi,i0 hh(k−m) hm−1 hi00 ic = 0 for at least two nonidentical indices, i i0
(3.2)
where h is the output vector as defined in equation 2.4. (ii) If only 1 ≤ L ≤ N − 2 k-order cumulants are nonzero, then any solution J of equation 3.2 is the product of a sign permutation by a matrix that separates the L sources having nonzero k-cumulants and such that the restriction 1 of J MK0 2 to the space of the N − L other sources is still an arbitrary (N − L) × (N − L) orthogonal matrix. The proof is given in appendix B. In theorem 2, the cases k = m and m = 1 are excluded so that equation 3.2 never reduces to the conditions in equation 3.1 of theorem 1. In fact, these conditions are a subset of the conditions in equation 3.2, corresponding to the cases i = i0 . However, theorem 2 is not exhausted by theorem 1: no condition on the parity of k (the order of the cumulants) is required for the second theorem. Theorem 2 shows that only a subset of all the k-order cumulants with three indices is enough in order to obtain separation. The case m = 2 is related to the joint diagonalization approach of Cardoso and Souloumiac (1993), which we will discuss in section 4, where algebraic solutions are presented. 3.2 Conditions Related to Algebraic Solutions. In the above approach, one has N2 +N unknowns (the mixture matrix elements and the k-cumulants of the sources), and we use more equations, namely, N(N + 1) + N2 2
Redundancy Reduction and Independent Component Analysis
1431
equations in the case of Theorem 1. Coming back to our result, the counting argument suggests that one should be able to separate the sources with fewer conditions, that is exactly N2 + N. One possibility would have been that diagonalizing the second-order cumulants and the symmetric part of the k-order cumulants in equation 3.1 would do. This is not the case: taking N = 2, one can easily check that in doing so, unwanted solutions appear. However, looking at the derivation of Theorems 1 and 2, one sees that it results from the fact that we are using cumulants linear in at least one of the hi . One has thus to try cumulants that are symmetric in i, i0 and linear in both hi and hi0 . This appears to be easy, and one gets another family of sufficient conditions with exactly N2 + N conditions for N2 + N unknowns. An additional advantage is that because of the linearity in hi , hi0 , one can solve the equations algebraically. In fact, these conditions appear to be related to (or to the extension of) known algebraic solutions, which we review in section 4. There is a price to pay for working with the minimal number of equations: the restriction that we had in Theorems 1 and 2, namely, that of having nonzero k-order statistics for the sources is replaced here by the much more restricting condition of having different cumulants, as stated below. 3.2.1 Conditions on Symmetric Higher Cumulants. ditions based on correlations at equal time.
We first consider con-
Theorem 3. Let all the cumulants ζ4a , as defined in equation 2.5 for k = 4 be different from one another; then J is equal to the inverse of M (up to a sign permutation and a rescaling, as explained above), if and only if one has: for every i, i0 , ½ hhi hi0 ic = δi,i0 PN (3.3) 2 0 hhi i00 =1 (hi00 ) hi0 ic = 0 for i 6= i . where h = {hi , i = 1, . . . , N} is the output vector as defined in equation 2.4. If one or several sets of sources have a comon value for ζ4a (but different from one set to the other), then any J solution of equation 3.3 is the product of a sign permutation by a matrix that separates the sources having a distinct 4-cumulant, 1 and separate the sets globally. The restriction of J MK0 2 to the subspace of a given set is still an arbitrary orthogonal matrix. The elementary proof is short enough to be given here. As before, we first solve the diagonalization of the two-point correlation and express h in term of the unknown orthogonal matrix X. The higher-order cumulants in equation 3.3 can then be expressed as + ( ) * N X X X 2 2 (hi00 ) hi0 = Xi,a (Xi”,a ) (3.4) ζ4a Xi0 ,a . hi i00 =1
c
a
i”
1432
Jean-Pierre Nadal and Nestor Parga
But since X is orthogonal, * hi
N X
+ 2
(hi00 ) hi0
=
i00 =1
P
2 i” (Xi”,a )
X
= 1, and equation 3.4 reduces to
Xi,a ζ4a Xi0 ,a .
(3.5)
a
c
What we want is to impose * hi
N X
+ 2
(hi00 ) hi0
i00 =1
= 1i δi,i0
(3.6)
c
where the 1i are yet indeterminate constants. Comparing equations 3.5 and 3.6, one sees that X gives the eigendecomposition of a matrix, which is in fact already diagonal. This implies that if all the eigenvalues, that is the ζ4a , are distinct, X is a sign permutation. Otherwise X is, up to a sign permutation, the identity matrix on the subspace corresponding to the distinct eigenvalues, and an arbitrary `×` orthogonal matrix on each subspace associated with an eigenvalue of degeneracy `. The algebraic solution associated with Theorem 3 was proposed by Cardoso (1989) (see section 4.3). P Finally, we note that in the above statement, one may replace i h2i by any Q(h) being a scalar depending polynomially on h (and by extension any analytical function of h), such that for any orthogonal matrix X, denoting its rows by Xa , a = 1, . . . , N, the N numbers hQ(σa Xa )σa2 ic , a = 1, . . . , N are distinct (here every σa should be understood as the normalized source √ σa 2 ). In practice, it may not be easy to prove that such a condition holds, hσa ic P even for the simplest cases, say, Q(h) = i hki with k even at least equal to 4. 3.2.2 Conditions on Time Correlations. We consider now conditions related to algebraic solutions based on time correlations (F´ety, 1988; Tong et al., 1990; Belouchrani, 1993; Molgedey & Schuster, 1994). These are extremely simple. If the sources present time correlations, namely, the two-point correlation matrix K(τ ) for some time delay τ > 0, K(τ )a,b ≡ hσa (t) σb (t − τ )ic
(3.7)
has nonzero diagonal elements: K(τ )a,b = δa,b Ka (τ ),
(3.8)
then one can state: Theorem 4. Let all the cumulants Ka (τ ), as defined in equation 3.8 be different from one another; then J is equal to the inverse of M (up to a sign permutation and a rescaling, as explained above) if and only if one has:
Redundancy Reduction and Independent Component Analysis
1433
for every i, i0 , ½
hhi (t)hi0 (t)ic = δi,i0 hhi (t) hi0 (t − τ )ic = 0 for i 6= i0
(3.9)
where h = {hi , i = 1, . . . , N} is the output vector as defined in equation 2.4. If one or several sets of sources have a comon value for Ka (τ ) (but different from one set to the other), then any J solution of equation 3.9 is the product of a sign permutation by a matrix that separates the sources having a distinct Ka (τ )1 cumulant, and separate the sets globally. The restriction of J MK0 2 to the subspace of a given set is still an arbitrary orthogonal matrix. The proof is essentially the same as the one of Theorem 3. One writes hhi (t) hi0 (t − τ )ic =
X
Xi,a Ka (τ ) Xi0 ,a .
(3.10)
a
What we want is to impose hhi (t) hi0 (t − τ )ic = 1i δi,i0
(3.11)
where the 1i are yet indeterminate constants. Equations 3.10 and 3.11 play the same role as equations 3.5 and 3.6 in the case of Theorem 3, and the rest of the proof follows in the same way. Averages are conveniently computed as time averages over a time window of some size T. In general, it could happen that the two-point correlation Ka (τ ) is a function of the particular time window [t − T, t] considered. However, if, on that time window, the sources are sufficiently independent (Ka (τ ) is diagonal), since the mixture matrix does not depend on time, the solution J obtained from the data collected on that single time window is an exact, time-independent solution. 3.3 Comparison with Other Criteria. To conclude this section, we comment briefly on other criteria, namely, those proposed by Comon (1994). This author has shown that ICA is obtained from maximizing the sum of the square of k-order cumulants, for any chosen k > 2. That is, it is sufficient to find an (absolute) minimum of
C≡−
N X h(hi )k i2c
(3.12)
i=1
for any given k > 2, after whitening has been performed. This is a nice result since it involves only N cumulants. However, defining a cost function is not exactly the same as giving explicit conditions to the cumulants (although
1434
Jean-Pierre Nadal and Nestor Parga
one can easily derive a cost function from these conditions, as we will do in section 6). In fact, the minimization of a cost function such as equation 3.12 does involve implicitly the computation of order N2 cumulants, as it should (see the counting argument at the begining of this subsection). This can be seen in two ways. First, since the value of the k-cumulants at the minimum is not known, the minimization of C given in equation 3.12 is not equivalent to a set of equations for these cumulants; then, if one wants to perform a gradient descent, one has to take the derivative of the cost function with respect to the couplings J, and this will generate cross cumulants. It is the resulting fixed-point equations that then matter in the counting argument (we will come back to the algorithmic aspects in section 6). Second, Comon showed also that the minimization of equation 3.12 is equivalent to setting to zero all the nondiagonal cumulants of the same order k (still in addition to whitening): hhi1 hi2 , . . . , hik ic = 0, for every set of k nonidentical indices.
(3.13)
Hence, what we have obtained is that only a small subset of these cumulants has to be considered. 4 Algebraic Solutions In this section we present four families of algebraic solutions. We point out their advantages and drawbacks insisting on their simplicity when formulated within the framework described in section 2 and relating them to the results of section 3. 4.1 Using Time-Delayed Correlations. In the case where each source σa shows time correlations, it has been shown (F´ety, 1988; Tong et al., 1990; Belouchrani, 1993; Molgedey & Schuster, 1994) that there is a simple algebraic solution using only second-order cumulants. More precisely, let us assume that the two-point correlation matrix K(τ ) for some time delay τ > 0, K(τ )a,b ≡ hσa (t) σb (t − τ )ic
(4.1)
has nonzero diagonal elements: K(τ )a,b = δa,b Ka (τ ).
(4.2)
It follows (Molgedey & Schuster, 1994) that source separation is obtained by asking for the rows of J to be the left eigenvectors of the following nonsymmetric matrix C(τ ) C0 −1 ,
(4.3)
Redundancy Reduction and Independent Component Analysis
1435
where C(τ ) is the second-order cumulant of the inputs at time delay τ : C(τ ) = hS(t) ST (t − τ )ic .
(4.4)
The equivalent, but more natural, aproach of F´ety (1988) and Tong et al. (1990) is to work with a symmetric matrix, making use of the reduction to the search for an orthogonal matrix presented in section 2. Indeed, let us first perform whitening (note that this implies the resolution of the eigenproblem for C0 , a task of the same complexity as the computation of its inverse, which is needed in equation 4.3). We then compute the correlations at time delay τ of the projected inputs h0 (as given by equation 2.13), that is the matrix hh0 (t) h0T (t − τ )ic . Using the expression equation 2.18 of h0 as a function of the sources, one sees that this matrix is in fact symmetric and given by: hh0 (t) h0T (t − τ )ic = O0T K 0 −1/2 K (τ ) K 0 −1/2 O0 .
(4.5)
This shows that the desired orthogonal matrix is obtained from solving the eigendecomposition of this correlation matrix (see equation 4.5). Remark. In thisRsection, averages might be conveniently time averages, such as hA(t)i = dt0 A(t − t0 ) exp(−t0 /T). In that case, τ has to be small compared to T in such a way that, for example, K0 is the same when averaging on t0 < t and on t0 < t − τ . 4.2 Using Correlations at Equal Time. Shraiman (1993) has shown how to reduce the problem of source separation to the diagonalization of a certain symmetric matrix D built on third-order cumulants. We refer readers to Shraiman (1993) for the elegant derivation of this result. Here we will show how to derive this matrix from the approach introduced in sections 2.2 and 3. We will do so by working with the k-order cumulants, giving a generalization of Shraiman’s work to any k at least equal to 3. We start with the expression of the data projected onto the principal components as in equation 2.18: h0 (t) = O0T K 0 −1/2 σ (t).
(4.6)
One computes the k-order statistics hh0i1 h0i2 , . . . , h0ik ic . From equation 4.6, these cumulants have the following expression in term of source cumulants: hh0i1 h0i2 , . . . , h0ik ic =
N X
0 0 Oa,i , . . . , Oa,i ζ a, 1 k k
(4.7)
a=1
where the ζka are the normalized cumulants as defined in equation 2.5. Now, one multiplies two such cumulants having k − 1 identical indices and sums
1436
Jean-Pierre Nadal and Nestor Parga
over all the possible values of these indices. This will produce the contraction of k−1 matrices O0 with their transposed, leading to 1N , and only two terms O0 will remain. Explicitly, we consider the symmetric matrix, to be referred to as the k-Shraiman matrix,
Di,i0 =
X i1 ,i2 ,...,ik−1
hh0i h0i1 , . . . , h0ik−1 ic hh0i0 h0i1 , . . . , h0ik−1 ic ,
(4.8)
which, from equation 4.7, is equal to
Di,i0 =
N X
0 0 a 2 Oa,i Oa,i 0 (ζk ) .
(4.9)
a=1
The above formula is nothing but an eigendecomposition of the matrix D. This shows that the rows of O0 are the eigenvectors of the k-Shraiman matrix, the eigenvalues being (ζka )2 for a = 1, . . . , N. A solution of the source separation problem is thus given by the diagonalization of one k-Shraiman matrix (e.g., taking k = 3 or k = 4). 4.3 A Simple Solution Using Fourth-Order Cumulants. We now consider the solution based on fourth-order cumulants (Cardoso, 1989), which is directly related to the results obtained in section 3.2.1. Let us consider the following cumulants of the input data projected onto the principal components, h0 : * C4 i,i0 ≡
h0i
N X i00 =1
+ 0 h02 i00 hi0
.
(4.10)
c
In term of the orthogonal matrix O0 and of the cumulants ζ4a , it reads (C4 )i,i0 =
N X
0 a Oa,i ζ4
a=1
N X
0 0 (Oa,i” )2 Oa,i 0.
(4.11)
i00 =1
Since O0 is orthogonal, this reduces to the equation
C4 = O0 K4 O0T ,
(4.12)
where K4 is the diagonal matrix of the fourth-order cumulants, (K4 )a,b = δa,b ζ4a .
(4.13)
This shows that O0 can be found by solving for the eigendecomposition of the cumulant C4 .
Redundancy Reduction and Independent Component Analysis
1437
All of the algebraic solutions considered thus far in this section are based on the same two facts: (1) the diagonalization of a real positive symmetric matrix leaves an arbitrary orthogonal matrix, which can be used to diagonalize another symmetric matrix; (2) but because one is dealing with a linear mixture, applying this to two well-chosen correlation matrices is precisely enough to solve the BSS problem (in particular, we have seen that working with two symmetric matrices provides at least as many equations as unknowns). 4.4 The Joint Diagonalization Approach. All of the algebraic solutions discussed so far suffer from the same drawback, which is that sources having the same statistics at the orders under consideration will not be separated. Moreover, numerical instability may occur if these statistics are different but very close to one another. One may wonder whether it would be possible to work with an algebraic solution involving more equations than unknowns, in such a way that indetermination cannot occur. A positive answer is given by the joint diagonalization method of Cardoso and Souloumiac (1993) and Belouchrani (1993). Here we give a different (and slightly more general) presentation from the one in Cardoso and Souloumiac (1993). In particular, we will make use of the theorems of section 3.1. We will consider only correlations at equal time. The case of time correlations is discussed in Belouchrani (1993). The basic idea is to joint diagonalize a family of matrices Γr r 0α,β = hh0α h0β Qr (h0 )ic ,
(4.14)
where h0 is the principal component vector as defined in equation 2.13 and the Qr are well-chosen scalar functions of it, the index r labeling the function (hence the matrix) in the family. One possible example is the family defined by taking for r all possible choices of k − 2 indices with k ≥ 3, r ≡ (α1 , . . . , αk−2 ), 1 ≤ α1 ≤ α2 , . . . , ≤ αk−2 ≤ N
(4.15)
Qr=(α1 ,...,αk−2 ) (h0 ) = h0α1 , . . . , h0αk−2 .
(4.16)
and
The case considered in Cardoso and Souloumiac (1993) is k = 4. Using the expression of h0 as a function of the normalized sources, h0 (t) = O0T K 0 −1/2 σ (t),
(4.17)
where O0 is the orthogonal matrix that we want to compute (see section 2.2), one can write
Γr = O0T Λr O0 ,
(4.18)
1438
Jean-Pierre Nadal and Nestor Parga
where Λr is a diagonal matrix with components 0 0 3ra = ζka Oa,α , . . . , Oa,α . 1 k−2
(4.19)
As it is obvious in equation 4.18, the matrices Γr are jointly diagonalizable by the orthogonal matrix O0 . However, if for at least a pair a, b one has 3ra = 3rb for every r, O0 is not the only solution (up to a sign permutation). Actually this never happens, as shown in Cardoso and Souloumiac (1993) for k = 4. Let us give a direct proof valid for any k. We consider one particular orthogonal matrix O, which jointly diagonalizes all the matrices of the family, and let h = Oh0 . By hypothesis, the matrix OΓr OT is diagonal, that is hhi hi0 h0α1 , . . . , h0αk−2 ic = 0 for i 6= i0 ,
(4.20)
and this for every choice of the k − 2 indices. Multiplying the left-hand side by Oi1 ,α1 , . . . , Oik−2 ,αk−2 and summing over the Greek indices, one gets that for any choice of i1 , . . . , ik−2 , hhi hi0 hi1 , . . . , hik−2 ic = 0 for i 6= i0 .
(4.21)
We can now make use of Theorem 2: one can write equation 4.21 for the particular choice i1 = i2 = . . . = ik−2 = i0 , which gives exactly the conditions (see equation 3.2) for k and m = 2, and we can thus apply Theorem 2 (equivalently, one can deduce from equation 4.21 that these cumulants are zero whenever any two indices are different, and then use equation 3.13, that is, Comon’s 1994 result). For the simplest case k = 3, these conditions are exactly those of Theorem 2: joint diagonalizing the N matrices Γα = hh0 h0T h0α ic is strictly equivalent to imposing the conditions (see equation 3.2) for k = 3 (and m = 2) (note, however, that the number of conditions is larger than the minimum required according to Theorem 1). For k > 3, the number of conditions in equation 4.21 is larger that the number of conditions in equation 3.2. To conclude, one sees that at the price of having a number of conditions larger than the minimum required (in order to guarantee that no indetermination will occur), source separation can be done with an algebraic method, even when there are identical source cumulants. Remark. In practical applications, cumulants are empirically computed, and thus the matrices under consideration are not jointly diagonalizable. For this reason, a criterion is considered in Cardoso and Souloumiac (1993) that, if maximized, provides the best possible approximation to joint diagonalization. In this article, we do not consider this aspect of the problem.
Redundancy Reduction and Independent Component Analysis
1439
5 Cost Functions Derived from Information Theory We switch now to the study of adaptive algorithms. To do so, one first has to define proper cost functions. Whereas in the next section we will consider cost functions based on cumulants, here we consider the particular costs derived from information theory. In both cases we will take advantage of the results obtained in section 3. We will see that an important outcome is the derivation of updating rules for the synaptic efficacies closely related to the Bienenstock, Cooper, and Munro (BCM) theory of cortical plasticity (Bienenstock, Cooper, & Munro, 1982). 5.1 From Infomax to Redundancy Reduction. Our starting point is the main result obtained in Nadal and Parga (1994), namely, that maximization of the mutual information between the input data and the output (neural code) leads to redundancy reduction, hence to source separation for both linear and nonlinear mixtures. To be more specific, we first give a short derivation of that fact (for more details see Nadal & Parga, 1994). We consider a network with N inputs and p outputs, and nonlinear transfer functions fi , i = 1, . . . , p. Hence the output V is given by a gain control after some (linear or nonlinear) processing: Vi (t) = fi (hi (t)), i = 1, . . . , p.
(5.1)
In the simplest case (in particular, in the context of BSS), h is given by the linear combination of the inputs: hi (t) =
N X
Ji,j Sj (t), i = 1, . . . , p.
(5.2)
j=1
However, here the hi (t) can be as well any deterministic (hence not necessarily linear) functions of the inputs S(t). In particular, we will make use of this fact in section 7: there, hi will be the local field at the output layer of a one-hidden-layer network with nonlinear transfer functions. The mutual information I between the input and the output is given by (see Blahut, 1988): Z P(V, S) . (5.3) I ≡ dp V dN SP(V, S) log Q(V) P(S) This quantity is well defined only if noise processing is taken into account (e.g., resolution noise). In the limit of vanishing additive noise, one gets that maximizing the mutual information is equivalent to maximizing the (differential) output entropy H(Q) of the output distribution Q = Q(V), Z (5.4) H(Q) = − dp V Q(V) log Q(V).
1440
Jean-Pierre Nadal and Nestor Parga
On the right-hand side of equation 5.4, one can make the change of variable V → h, using p Y
dVi Q(V) =
p Y
dhi 9(h)
(5.5)
i = 1, . . . , p.
(5.6)
9(h) . dh9(h) ln Qp 0 i=1 fi (hi )
(5.7)
i=1
i=1
and dVi = fi0 (hi )dhi , This gives Z H(Q) = −
This implies that H(Q), hence I , is maximal when 9(h) factorizes, 9(h) =
p Y
9i (hi ),
(5.8)
i=1
and at the same time for each output neuron, the transfer function fi has its derivative equal to the corresponding marginal probability distribution: fi0 (hi ) = 9i (hi ),
i = 1, . . . , p.
(5.9)
As a result, infomax implies redundancy reduction. The optimal neural representation is a factorial code—provided it exists. 5.2 The Specific Case of BSS. Let us now return to the BSS problem for which the h are taken as linear combinations of the inputs. By hypothesis, the N-dimensional input is a linear mixture of N-independent sources. In the following we consider only p = N. Note that the factorial code is obtained by the network processing before applying the nonlinear function at each output neuron. From the algorithmic aspect, as suggested in Nadal and Parga (1994), this gives us two possible approaches. One is to optimize globally, that is, to maximize the mutual information over both the synaptic efficacies and the transfer functions. In that case, infomax is used in order to perform ICA—the nonlinear transfer functions being there just to enforce factorization. Another possibility is first to find the synaptic efficacies leading to a factorial code, and then compute the optimal transfer functions (which depend on the statistical properties of the stimuli). In that case, one may say that it is ICA that is used in order to build the network that maximizes information transfer. Still, if one considers that the transfer functions are chosen at each
Redundancy Reduction and Independent Component Analysis
1441
instant of time according to equation 5.9, the mutual information becomes precisely equal to minus the redundancy cost function R: Z
R≡
dh9(h) ln Qp
9(h)
i=1 9i (hi )
.
(5.10)
In the context of blind source separation the relevance of the redundancy cost function has been recognized by Comon (1994) (we also note a work by Burel [1992], where a different but related cost function is considered). Remark on the terminology. The quantity (see equation 5.10), called here and in the literature related to sensory coding the redundancy, is called in the signal processing literature (in particular in Comon, 1994) the mutual information, short for the mutual information between the random variables hi (the outputs). But this mutual information, that is the redundancy (see equation 5.10), should not be mistaken for the mutual information (see equation 5.3) we introduced above, which is defined in the usual way—that is, between the input and the output of a processing channel (Blahut, 1988). To avoid confusion, we will consistently use redundancy for equation 5.10, and mutual information for equation 5.3. Although it is appealing to work with either the mutual information or the redundancy, doing so may not be easy. It is convenient to rewrite the output entropy, changing the variable h → S in equation 5.7, as done in Bell and Sejnowski (1995). Since the input entropy H(P) is a constant (it does not depend on the couplings J), the quantity that has to be maximized is
E = ln |J| +
X hlog ci (hi )i,
(5.11)
i
where |J| is the absolute value of the determinant of the coupling matrix J and h.i is the average over the output activity hi . The function ci can be given two interpretations: it is equal to either fi0 , if one considers the mutual information, or to 9i , if one considers the redundancy (the mutual information for the optimal transfer function at a given J). In the first case, one has to find an algorithm for searching for the optimal transfer functions; in the second case, one has to estimate the marginal distributions. The cost (see equation 5.11) can be given another interpretation. In fact, it was first derived in a maximum likelihood approach (Gaeta & Lacoume, 1990; Pham, Garrat, & Jutten, 1992): it is easy to see that equation 5.11, with ci = 9i , is equal to the (average of) the log likelihood of the observed data (the inputs S), given that they have been generated as a linear combination of independent sources with the 9i as marginal distributions. In the two following subsections we consider practical approaches.
1442
Jean-Pierre Nadal and Nestor Parga
5.3 Working with a Given Family of Transfer Functions. We have seen that at the end of the optimization process, the transfer functions will be related to the probability distributions of the independent sources. Since these distributions are not known—and cannot be estimated without first performing source separation—the choice of a proper parameterized family may be a problem. Still, any prior knowledge on the sources and any reasonable assumption may be used to limit the search to a family of functions controlled by a small number of parameters. A practical way to search for the best fi0 is to restrict the search to an a priori chosen family of transfer functions. In Pham et al. (1992), a practical algorithm is proposed, based on a particular choice combined with an expansion of the cost close to a solution. Another, and very simple, strategy has been tried in Bell and Sejnowski (1995), where very promising results have been obtained on some specific applications. Their numerical simulations suggest that one can take transfer functions with a simple behavior (that is, for example, with one peak in the derivative when the data show only one peak in their distribution), and to optimize just the gain and the threshold in each transfer function, which means fitting the location of the peak and its height.
5.4 Cumulant Expansion of the Marginal Distributions. When working with the redundancy, one would like to avoid having to estimate the marginal distributions from histograms, since this would take a lot of time. One may parameterize the marginal probability distributions and adapt the parameters at the same time one is adapting the couplings. This is exactly the same as working with the mutual information with a parameterized family of transfer functions. Another possibility, considered in Gaeta and Lacoume (1990) and Comon (1994), is to replace each marginal by a simple probability distribution with the same first cumulants as the ones of the actual distribution. Recently this approach has been used in Amari et al. (1996). We consider this expansion here with a slightly different point of view in order to relate this approach to the results of section 3. We know that if we had gaussian distributions, every required computation would be easy. Now, if N is large, each field hi is a sum of a large number of random variables, so that before adaptation (that is, with arbitrary synaptic efficacies), the marginal distribution for hi is a gaussian. However, through adaptation, each hi becomes proportional to one source σα —whose distribution is in general not a gaussian, and not necessarily close to gaussian. Still, there is another, and stronger, motivation for considering such an approximation. Indeed, the result of section 3, which is that conditions on a limited set of cumulants are sufficient in order to obtain factorization, strongly suggests replacing the unknown distribution with a simple distribution having the same first cumulants up to some given order. Let us consider the systematic close-to-gaussian cumulant expansion of
Redundancy Reduction and Independent Component Analysis
1443
9i (hi ) (Abramowitz & Stegun, 1972). At first nontrivial order, it is given by ¸ · hi (h2i − 3) 9i (hi ) ≈ 9i1 (hi ) ≡ 9 0 (hi ) 1 + λ(3) , i 6
(5.12)
where 9 0 (hi ) is the normal distribution µ 2¶ h 1 9 0 (hi ) ≡ √ exp − i , 2 2π
(5.13)
and λ(3) i is the third (true) cumulant of hi : 3 λ(3) i ≡ hhi ic .
(5.14)
In expression 5.12, we have taken into account that, as explained in section 2, one can always take hhi i = 0.
(5.15)
hh2i ic = 1.
(5.16)
and
In the cost function (see equation 5.11), that is,
E = ln |J| +
XZ
dhi 9i (hi ) log 9i (hi ),
(5.17)
i
we replace 9i (hi ) by 9i1 (hi ), and expand the logarithm · ¸ hi (h2i − 3) ln 1 + λ(3) . i 6 Then the quantity to be maximized is, up to a constant,
E = ln |J| +
N 1X [λ(3) ]2 . 6 i=1 i
(5.18)
Since optimization has to be done under the constraints in equation 5.16, we add Lagrange multipliers:
E (ρ) = E −
N X 1 i=1
2
ρi (hh2i ic − 1).
(5.19)
1444
Jean-Pierre Nadal and Nestor Parga
Taking into account equation 5.15, one then obtains the updating equation for a given synaptic efficacy Jij : 1Jij ∝ −
dE (ρ) dJij
dE (ρ) = −JijT −1 − hh3i ic h(h2i − 1)Sj i + ρi hhi Sj i. dJij
(5.20)
We now consider the fixed-point equation, that is, 1Jij = 0. Multiplying by Ji0 j and summing over j, it reads: δii0 = ρi hhi hi0 ic − hh3i ic hh2i hi0 ic
(5.21)
together with hh2i ic = 1 for every i. The parameters ρi are obtained by writing the fixed-point equation (5.21) at i = i0 , that is, ρi = 1 + hh3i i2c .
(5.22)
Note that, in particular, ρi > 0 for all i. It follows from the result (see equation 3.1) of section 3 that the exact, desired solutions are particular solutions of the fixed-point equation (5.21), giving a particular absolute minimum of the cost function with the closeto-gaussian approximation. However, there is no guarantee that no other local minimum exists: there could be solutions for which equation 5.21 is satisfied with nondiagonal matrices hhi hi0 ic and hh2i hi0 ic . Remark. One may wonder what happens if one first performs whitening, computing the h0 , and then uses the mutual information between h0 and h. This is what is studied in Comon (1994), where at lowest order, the cost function (see equation 5.18) is found to be the sum of the square of the third cumulants. This can be readily seen from equation 5.18, where J is now the orthogonal matrix that takes h0 to h, and thus ln |J| is a constant. 5.5 Link with the BCM Theory of Synaptic Plasticity. Let us now consider a possible stochastic implementation of the gradient descent (see equation 5.20). Since there are products of averages, it is not possible to have a simple stochastic version, where the variation of Jij would depend on the instanteneous activities only. Still, by removing one of the averages in equation 5.20 one gets the following updating rule: 1Jij = ²{JijT −1 − ρi hi Sj + hh3i ic h2i Sj },
(5.23)
where ² is a parameter controlling the rate of variation of the synaptic efficacies. The parameters ρi can be taken at each time according to the fixed-point
Redundancy Reduction and Independent Component Analysis
1445
equation (5.22). It is quite interesting to compare the updating equation (5.23) with the BCM theory of synaptic plasticity (Bienenstock et al., 1982). In the latter, a qualitative synaptic modification rule was proposed in order to account for experimental data on neural cell development in the early visual system. This BCM rule can be seen as a nonlinear variant of the Hebbian covariance rule. One of its possible implementations reads, in our notation: 1Jij = ² γi {−hi Sj + 2i h2i Sj },
(5.24)
where γi and 2i are parameters possibly depending on the current statistics of the cell activities. The particular choices 2i = hh2i i, γi = 1 or 2−1 i have been studied with some detail (Intrator & Cooper, 1992; Law & Cooper, 1994). The two main features of the BCM rule are: (1) there is a synaptic increase or decrease depending on the postsynaptic activity relative to some threshold 2i , which itself varies according to the cell mean activity level; (2) at low activity, there is no synaptic modification. Since ρi is positive, the rule we derived above is quite similar to equation 5.24, with a threshold 2i equal to hh3i ic . ρi The main difference is in the constant (that is, activity-independent) term JijT −1 . This term plays a crucial role: it couples the N neural cells. Note in fact that the BCM rule has been mostly studied for a single cell, and only crude studies of its possible extension to several cells have been performed (Scofield & Cooper, 1985). Note also that in our formulas we have always assumed zero mean activity (hSj i = 0, hence also hhi i = 0). If this were not the case, the corresponding averages have to be subtracted (Sj → Sj − hSj i for every j, hi → hi − hhi i for every i). Finally we note that if the third cumulants are zero, one has to make the expansion up to the fourth order. The corresponding derivation and the conclusions are similar: apart from numerical factors, essentially the square of the third cumulant in the cost is replaced by the square of the fourth cumulant, and one gets again a plasticity rule similar to equation 5.23, that is with the same qualitative behavior as the BCM rule. 6 Adaptive Algorithms from Cost Functions Based on Cumulants 6.1 A Gradient Descent Based on Theorem 1. Among the algorithms using correlations at equal time, only algebraic solutions (discussed in section 4) and the recently proposed deflation algorithm (Delfosse & Loubaton, 1995), which extracts the independent components one by one, guarantee to find them in a rather simple and efficient way. All other approaches suffer from the same problem: empirical updating rules based on high moments, like the Herault-Jutten algorithm, and gradient methods based on some cost
1446
Jean-Pierre Nadal and Nestor Parga
function (most often a combination of cumulants), may have unwanted fixed points (see Comon, 1994; Delfosse & Loubaton, 1995). Whatever the algorithm one is working with, the conditions derived in section 3 can be used in order to check whether a correct solution has been found. Clearly one can also define cost functions from these conditions by taking the sum of the square of every cumulant that has to be set to zero. We thus have a cost function for which, in addition to having only good solutions as absolute minima, the value of the cost at an absolute minimum is known: it is zero. Of course, many other families of cumulants could be used for the same purpose. The possible interest of the one we are dealing with is that it involves a small number of terms. However, this does not imply a priori any particular advantage as far as efficiency is concerned. For illustrative purposes, we consider with more detail a gradient descent for a particular choice of cost based on Theorem 1 in section 3. Specifically, we ask for the diagonalization of the two-point correlation and the thirdorder cumulants hhi h2i0 ic . Here again we use the reduction to the search for an orthogonal transformation, as explained in section 2. We thus consider the optimization of the orthogonal matrix. The cost is then defined by
E=
1 X 1 h hi h2i0 i2c − Tr[ρ(OOT − 1N )] 2 i6=i0 2
(6.1)
where ρ is a symmetric matrix of Lagrange multipliers, and hi has to be written in term of O (see equation 2.15): hi =
N X α=1
Oi,α h0α ,
(6.2)
the h0i being the projections of the inputs onto the principal components, as given by equation 2.13. The simplest gradient descent scheme is given by dE Oi,α = −ε , dt dOi,α
(6.3)
where ε is some small parameter. From equation 6.1, one derives the derivative of E with respect to Oi,α : dE dOi,α
=
Xh
hhi h2i0 ic h h0α h2i0 ic + 2hhi0 h2i ic hhi0 hi h0α ic
i0 (6=i)
−
X i0
ρi,i0 Oi0 ,α .
i
(6.4)
Redundancy Reduction and Independent Component Analysis
1447
One can either adapt ρ according to dρ dE = −ε , dt dρ or choose ρ imposing at each time OOT = 1N , which we do here. The equation for ρ is obtained by writing the orthonogonality condition for O, (O + dO)(OT + dOT ) = 1N , that is:
O dOT + dO OT = 0.
(6.5)
Paying attention to the fact that ρ is symmetric, one gets ρi,i0 =
i 1 Xh hhi h2k ic hhi0 h2k ic + 2hhk h2i ic hhk hi hi0 ic 2 k(6=i0 ) i 1 Xh hhi h2k ic hhi0 h2k ic + 2hhk h2i0 ic hhk hi hi0 ic . + 2 k(6=i)
(6.6)
Replacing in equation 6.4 ρ by its expression (6.6), multiplying both sides of equation 6.4 by Oi0 ,α for some i0 and summing over α, and using equation 6.2, one gets the rather simple equations for the projections of the variations of O onto the N vectors Oi : X α
Oi0 ,α
X dOi,α dE = −ε Oi0 ,α ≡ −ε ηi,i0 , dt d O i,α α
(6.7)
where: ηi,i0 =
i 3h 3 hhi0 ic hhi h2i0 ic − hh3i ic hhi0 h2i ic 2 h i X hhk hi hi0 ic hhk h2i ic − hhk h2i0 ic . +
(6.8)
k
From the above expression, one can easily write the (less simple) updating equations for either O (multiplying by Oi0 ,α and summing over i0 ), or J −1
(multiplying by [OΛ0 2 O0 ]i0 ,j and suming over i0 ). Note that the Lagrange multiplier ρ ensures that, starting from an arbitrary orthogonal matrix O(0) at time 0, O(t) remains orthogonal. In practice, since this orthogonality is enforced only at first order in ε, an explicit normalization will have to be done from time to time This is an efficient method used in statistical mechanics and field theory (Aldazabal, Gonzalez-Arroyo, & Parga, 1985). Although we derived the updating equation from a global cost function, one may also derive an adaptative version. One possibility is to use the
1448
Jean-Pierre Nadal and Nestor Parga
approach considered in the preceding section: in each term containing a product of averages, one average h.i is replaced by the instantaneous value. The remaining average is computed as a time average on a moving time window. An alternative approach is to replace each average by a time average, taking different time constants in order to obtain an estimate of the product of averages—and not the average of the product. 6.2 From Feedforward to Lateral Connections. We conclude with a general remark concerning the choice of the architecture. In all the above derivations, we worked with a feedforward network, with no lateral connections. As it is well known, one may prefer to work with adaptable lateral connections, as it is the case in the Herault-Jutten algorithm (Jutten & Herault, 1991). One can in fact perform any given linear processing with either one or the other architecture; it is only the algorithmic implementation that might be simpler with a given architecture. Let us consider here this equivalence. A standard way to use a network with lateral connections is the one in Jutten and Herault (1991). One has a unique link from each input Si to the output unit i and lateral connections L between output units. The dynamics of the postsynaptic potentials ui of the output cells is given by X dui =− Li,i0 ui0 + Si . dt i0
(6.9)
If L has positive eigenvalues, then the dynamics converge to a fixed point h given by L h = S. As a result, after convergence, the network gives the same linear processing as the one with feedforward connections J given by
L = J−1 .
(6.10)
In the particular case considered above, one gets the updating rule for Lj,i0 = 1
−1 2 0 Jj,i 0 by multiplying equation 6.7 by [OΛ0 O ]i,j and summing over i. For a given adaptive algorithm derived from the minimization of a cost function, one can thus work with either the feedforward or the lateral connections. It is clear that, in general, updating rules will look different whether they are for the lateral or feedforward connections. However, it is worth mentioning that if we consider the updating rule after whitening (which is to assume that a first network is performing PCA, providing the h0α as input to the next layer), then the updating rule for the feedforward and the lateral connections is essentially the same. Indeed, the feedforward coupling matrix that allows going from h0 to h is an orthogonal transformation O; hence the associated lateral network has as couplings the inverse of that orthogonal transformation, that is, its transposed, OT .
Redundancy Reduction and Independent Component Analysis
1449
7 Possible Extensions to Nonlinear Processing Some work has already proposed redundancy reduction criteria for nonlinear data processing (Nadal & Parga, 1994; Haft, Schlang, & Deco, 1995; Parra, 1996), and for defining unsupervised algorithms in the context of automatic data clustering (Hinton, Dayan, Frey, & Neal, 1995). Here we just point out that all the criteria and cost functions discussed in this article may be applied to the output layer of a multilayer network for performing independent component analysis on nonlinear data. Indeed, if a multilayer network is able to perform ICA, this implies that in the layer preceding the output, the data representation is a linear mixture. The main questions are, then, How many layers are required in order to find such a linear representation? and Is it always possible to do so? Assuming that there exists at least one (possibly nonlinear) transformation of the data leading to a set of independent features, we suggest two lines of research. The first is based on general results on function approximation. It is known that a network with one hidden layer with sufficiently many units is able to approximate a given function with any desired accuracy (provided sufficiently many examples are available) (Cybenko, 1989; Hornik, Stinchcombe, & White, 1991; Barron, 1993). Then there exists a network with one hidden layer and N outputs such that the ith output unit gives an approximation of the particular function that extracts the ith independent component from the data. Hence, we know that it should be enough to take a network with one hidden layer. A possible approach is to perform gradient descent onto a cost function defined for the output layer, which, if minimized, means that separation has been achieved (we know that the redundancy will do). If the algorithm does not give good results, then one may increase the number of hidden units. Another approach is suggested by the study of the infomax-redundancy reduction criteria (Nadal & Parga, 1994). It is easy to see that the cost function (see equation 5.11) has a straightforward generalization to a multilayer network where every layer has the same number N of units. Indeed, if one calls Jk the couplings in the kth layer, and cki (hki ) the derivative of the transfer function (or the marginal distribution; see section 5.2) of the ith neuron in the kth layer, the mutual information between the input and the output of the multilayer network can be written as
EL =
X k
ln |Jk | +
X X k
hlog cki (hki )i.
(7.1)
i
Hence the cost EL is a sum of terms, each tending to impose factorization in a given layer. This allows an easy implementation of a gradient descent algorithm. Moreover, this additive structure of the cost suggests a constructive approach. One may start with one layer; if factorization is not obtained,
1450
Jean-Pierre Nadal and Nestor Parga
one can add a second layer, and so on (note, however, that the couplings of a given layer have to be readapted each time a new layer is added). 8 Conclusion In this article, we have presented several new results on blind source separation. Focusing on the mathematical aspects of the problem, we obtained several necessary and sufficient conditions that, if fulfilled, guarantee that separation has been performed. These conditions are on a limited set of cross-cumulants and can be used for defining an appropriate cost function or just in order to check, when using any BSS algorithm, that a correct solution has been reached. Next, we showed how algebraic solutions can be easily understood, and for some of them generalized, within the framework of the reduction to the search for an orthogonal matrix. We then discussed adaptive approaches, the main focus being on cost functions based on information-theoretic criteria. In particular, we have shown that the resulting updating rule appears to be, in a loose sense, Hebbian and more precisely quite similar to the type proposed by Bienenstock, Cooper, and Munro in order to account for experimental data on the development of the early visual system (Bienenstock et al., 1982). We also showed how some cost functions could be conveniently used for nonlinear processing, that is, for, say, a multilayer network. In all cases, we paid attention to relate our work to other similar approaches. We showed how the reduction to the search for an orthogonal transformation is a convenient tool for analyzing the BSS problem and finding new solutions. This, of course, does not mean that one cannot perform BSS without whitening, and indeed there are interesting approaches to BSS in which whitening is not required (Laheld & Cardoso, 1994). Appendix A: Proof of Theorem 1 Let us consider a matrix J for which equation 3.1 is true. First we use the fact that J diagonalize the two-point correlation. Hence, with the notation and results of section 2, we have to determine the family of orthogonal matrices X such that, when h = XK 0 −1/2 σ
(A.1)
the k-order cumulants in equation 3.1 are zero. Using the above expression of h, we have for any k-order cumulant in equation 3.1, hi0 ic = hh(k−1) i
N X (Xi,a )(k−1) Xi0 ,a , ζka , a=1
(A.2)
Redundancy Reduction and Independent Component Analysis
1451
where the ζka are the normalized k-cumulants ζka =
hσak ic k/2
hσa2 ic
,
a = 1, . . . , N.
(A.3)
A.1 The Case of N Nonzero Source Cumulants. We first consider the case when for every a, ζka is not zero. Since we want the k-order cumulants to be zero whenever i 6= i0 , we can write 1i δi,i0 =
N X (Xi,a )(k−1) Xi0 ,a ζka
(A.4)
a=1
for some yet indeterminate constants 1i . Using the fact that X is orthogonal, T we multiply both sides of this equation by Xi0 ,a = Xa,i 0 for some a and sum over i0 . This gives 1i Xi,a = (Xi,a )(k−1) ζka .
(A.5)
There are now two possibilities for each pair (i, a): either Xi,a = 0, or Xi,a is nonzero (and then 1i as well), and we can write (k−2) Xi,a = εi,a
1i , ζka
(A.6)
where εi,a is 1 or 0. For k odd, one then has · Xi,a = εi,a
1i ζka
¸
1 k−2
.
(A.7)
We now use the fact that X is orthogonal; first, for each i, the sum over a of 2 is one, hence for at least one a ε is nonzero—and it follows also the Xi,a i,a P that for every i 1i is nonzero. Second, for every pair i 6= i0 , a Xi,a Xi0 ,a = 0. Then, from equation A.7, we have X a
2
εi,a εi0 ,a (ζka ) k−2 = 0.
(A.8)
The left-hand side is a sum of positive terms; hence each has to be zero. It follows that for every a, either εi,a or εi0 ,a is zero (or both). The argument can be repeated exchanging the roles of the indices i and a, so that it is also true that for each pair a 6= a0 , for each i either εi,a or εi,a0 is zero (or both). Hence X is a matrix with, on each row and on each column, a single nonzero element, which, necessarily, is then ±1: X is what we called a sign permutation, and this completes the proof of part (i) of the theorem for the case where the k-order cumulants are nonzero for every source.
1452
Jean-Pierre Nadal and Nestor Parga
Remark. For k even, there is a sign indetermination when going from P (k−2) Xi,a to Xi,a . Hence one cannot write that a Xi,a Xi0 ,a is a sum of positive numbers. In fact, taking k = 4 and N even, one can easily build an example where the conditions in equation 3.1 are fulfilled with at least one solution X that is not a sign permutation. For instance, let N = 4 and ζ4a = z for every a. Then the equations 3.1 are fulfilled for X defined by:
1 1 1 X= 1 2 1
1 1 −1 −1
1 −1 1 −1
−1 1 −1 1
(A.9)
A.2 The Case of L < N Nonzero Source Cumulants. Now we consider the case where only L < N k-order cumulants are nonzero (and this will include the case of only one source with zero cumulant). Without loss of generality, we will assume that they are the first L sources (a = 1, . . . , L). We have to reconsider the preceding argument. It started with equation A.4 in which the right-hand side can be considered as a sum over a = 1, . . . , L. It follows that a particular family of solutions is the set of block-diagonal matrices X, with an L × L block being a sign permutation matrix, followed by an (N − L) × (N − L) block being an arbitrary (N − L) × (N − L) orthogonal matrix. We show now that these are, up to a global permutation, the only solutions. Since the k-cumulants are zero for a > L, we have the following possibilities for each i: 1. 1i = 0 and Xi,a = 0, a = 1, . . . , L. 2. 1i 6= 0, Xi,a = 0, a > L, and for a = 1, . . . , L equation A.6 is valid with at least one εi,a nonzero. By applying an appropriate permutation, we can assume that it is the first ` indices, i = 1, . . . , `, for which 1i 6= 0. Hence X has a nonzero upperleft ` × L block, X1 , which satisfies all the equations derived previously (in particular, X1 X1T = 1N ), but with i = 1, . . . , ` and a = 1, . . . , L; and a nonzero lower-right (N−`)×(N−L) block, X2 , for which the only constraint is X2 X2T = X2T X2 = 1N . It follows from the discussion of the case with nonzero cumulants that X1 has a single nonzero element per line and per column, which implies that this is a square matrix: ` = N, and this completes the proof. In addition, if only one source has zero k-order cumulant, then the X2 matrix is just a 1 × 1 matrix, whose unique element is thus ±1. Hence, in that case, it is in fact all the N sources that are separated.
Redundancy Reduction and Independent Component Analysis
1453
Appendix B: Proof of Theorem 2 In the first part of the proof, we proceed exactly as in appendix A. We reduce the problem to the search of an orthogonal matrix X such that, when h = XK 0 −1/2 σ
(B.1)
the k-order cumulants in equation 3.2 are zero. Using the above expression of h, we have for any k-order cumulant in equation 3.2, hm−1 hi00 ic = hh(k−m) i i0
N X a (Xi,a )(k−m) Xim−1 0 ,a Xi00 ,a ζk
(B.2)
a=1
where the ζka are the normalized k-cumulants as in equation A.3. Now we write that these quantities are zero whenever at least two of the indices i, i0 , i00 are different: 1i δi,i0 δi0 ,i00 =
N X a (Xi,a )(k−m) Xim−1 0 ,a Xi00 ,a ζk
(B.3)
a=1
for some yet indeterminate constants 1i . Using the fact that X is orthogonal, we multiply both sides of this equation by Xi00 ,a for some a for which ζka is nonzero, and sum over i00 . This gives 1i Xi0 ,a δi,i0 = (Xi,a )(k−m) (Xi0 ,a )(m−1) ζka .
(B.4)
Now, either Xi0 ,a = 0, or Xi0 ,a 6= 0 and then 1i δi,i0 = (Xi,a )(k−m) (Xi0 ,a )(m−2) ζka .
(B.5)
At this point the proof differs from the one of Theorem 1 and is in fact simpler. For i 6= i0 the right-hand side of equation B.5 is 0. Because Xi0 ,a 6= 0 and k is strictly greater than m, we obtain Xi,a = 0. Hence, for every a such that ζka 6= 0, there is at most one i for which Xi,a 6= 0. Since X is orthogonal, this implies that its restriction to the subspace of nonzero ζka is a sign permutation. Acknowledgments This work was partly supported by the French-Spanish program Picasso, the E.U. grant CHRX-CT92-0063, the Universidad Autonoma ´ de Madrid, the Universit´e Paris VI, and Ecole Normale Sup´erieure. NP and JPN thank, respectively, the Laboratoire de Physique Statistique (ENS) and the Departamento de F´ısica Teorica ´ (UAM) for their hospitality. We thank B. Shraiman for communicating his results prior to publication and P. Del Giudice and A. Campa for useful discussions. We thank an anonymous referee for useful comments, which led us to elaborate on the joint diagonalization method.
1454
Jean-Pierre Nadal and Nestor Parga
References Abramowitz, M., & Stegun, I. A. (1972). Handbook of mathematical functions. New York: Dover. Aldazabal, G., Gonzalez-Arroyo, A., & Parga, N. (1985). The stochastic quantization of U(N) and SU(N) lattice gauge theory and Langevin equations for the Wilson loops. Journal of Physics, A18, 2975. Amari, S. I., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems 8. Cambridge, MA: MIT Press. Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? NETWORK, 3, 213–251. Attneave, F. (1954). Informational aspects of visual perception. Psychological Review, 61, 183–193. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. Rosenblith (ed.), Sensory communication (p. 217). Cambridge, MA: MIT Press. Barlow, H. B., Kaushal, T. P., & Mitchison, G. J. (1989). Finding minimum entropy codes. Neural Comp., 1, 412–423. Bar-Ness, Y. (1982, November). Bootstrapping adaptive interference cancelers: Some practical limitations. In The Globecom Conf. (pp. 1251–1255). Paper F3.7, Miami, Nov. 1982. Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Trans. I. T., 39. Bell, A., & Sejnowski, T. (1995). An information-maximisation approach to blind separation and blind deconvolution. Neural Comp., 7, 1129–1159. Belouchrani, A., & M. (1993). S´eparation aveugle au second ordre de sources corr´el´ees. In GRETSI’93, Juan-Les-Pins, 1993. Bienenstock, E., Cooper, L., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal Neurosciences, 2, 32–48. Blahut, R. E. (1988). Principles and practice of information theory. Reading, MA: Addison-Wesley. Burel, G. (1992). Blind separation of sources: A nonlinear neural algorithm. Neural Networks, 5, 937–947. Cardoso, J.-F. (1989) Source separation using higher-order moments. In Proc. Internat. Conf. Acoust. Speech Signal Process.-89 (pp. 2109–2112). Glasgow. Cardoso, J.-F., & Souloumiac, A. (1993). Blind beamforming for non Gaussian signals. IEE Proceedings-F, 140(6), 362–370. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314. Cybenko, G. (1989). Approximations by superpositions of a sigmoidal function. Math. Contr. Signals Syst., 2, 303–314. Delfosse, N., & Loubaton, Ph. (1995). Adaptive blind separation of independent sources: A deflation approach. Signal Processing, 45, 59–83.
Redundancy Reduction and Independent Component Analysis
1455
Del Giudice, P., Campa, A., Parga, N., & Nadal, J.-P. (1995). Maximization of mutual information in a linear noisy network: A detailed study. NETWORK, 6, 449–468. Deville, Y., & Andry, L. (1995). Application of blind source separation techniques to multi-tag contactless identification systems. Proceedings of NOLTA’95 (pp. 73-78). Las Vegas. F´ety, L. (1988). M´ethodes de traitement d’antenne adapt´ee aux radio-communications. Unpublished doctoral dissertation, ENST, Paris. Gaeta, M., & Lacoume, J. L. (1990). Source separation without apriori knowledge: The maximum likelihood approach. In L. Tores, E. MasGrau, & M. A. Lagunas (eds.), Signal Processing V, proceedings of EUSIPCO 90 (pp. 621-624). Haft, M., Schlang, M., & Deco, G. (1995). Information theory and local learning rules in a self-organizing network of ising spins. Phys. Rev. E, 52, 2860–2871. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1158–1160. Hopfield, J. J. (1991). Olfactory computation and object perception. Proc. Natl. Acad. Sci. USA, 88, 6462–6466. Hornik, K., Stinchcombe, M., & White, H. (1991). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Intrator, N., & Cooper, L. (1992). Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions. Neural Networks, 5, 3–17. Jutten, C., & Herault, J. (1991). Blind separation of sources, part i: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–10. Laheld, B., & Cardoso, J.-F. (1994). Adaptive source separation without prewhitening. In Proc. EUSIPCO (pp. 183–186). Edinburgh. Law and Cooper, L. (1994). Formation of receptive fields in realistic visual environments according to the BCM theory. Proc. Natl. Acad. Sci. USA, 91, 7797– 7801. Li, Z., & Atick, J. J. (1994). Efficient stereo coding in the multiscale representation. Network: Computation in Neural Systems, 5, 1–18. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21, 105– 117. Molgedey, L., & Schuster, H. G. (1994). Separation of a mixture of independent signals using time delayed correlations. Phys. Rev. Lett., 72, 3634–3637. Nadal, J.-P., & Parga, N. (1994). Nonlinear neurons in the low-noise limit: A factorial code maximizes information transfer. NETWORK, 5, 565–581. Parra, L. C. (1996). Symplectic nonlinear component analysis. Preprint. Pham, D.-T., Garrat, Ph. & Jutten, Ch. (1992). Separation of a mixture of independent sources through a maximum likelihood approach. In Proc. EUSIPCO (pp. 771–774). Redlich, A. N. (1993). Redundancy reduction as a strategy for unsupervised learning. Neural Comp., 5, 289–304. Rospars, J.-P., & Fort, J.-C. (1994). Coding of odor quality: Roles of convergence and inhibition. NETWORK, 5, 121–145. Scofield, C. L., & Cooper, L. (1985). Development and properties of neural networks. Contemporary Physics, 26, 125–145.
1456
Jean-Pierre Nadal and Nestor Parga
Shraiman, B. (1993). Technical memomorandum (AT&T Bell Labs TM-11111930811-36). Tong, L., Soo, V., Liu, R., & Huang, Y. (1990). Amuse: A new blind identification algorithm. In Proc. ISCAS. New Orleans. van Hateren, J. H. (1992). Theoretical predictions of spatiotemporal receptive fields of fly lmcs, and experimental validation. J. Comp. Physiol. A, 171, 157– 170.
Received June 26, 1996; accepted February 19, 1997.
Communicated by Erkki Oja
Adaptive Online Learning Algorithms for Blind Separation: Maximum Entropy and Minimum Mutual Information Howard Hua Yang Shun-ichi Amari Laboratory for Information Representation, FRP, RIKEN Hirosawa 2-1, Wako-shi, Saitama 351-01, Japan
There are two major approaches for blind separation: maximum entropy (ME) and minimum mutual information (MMI). Both can be implemented by the stochastic gradient descent method for obtaining the demixing matrix. The MI is the contrast function for blind separation; the entropy is not. To justify the ME, the relation between ME and MMI is first elucidated by calculating the first derivative of the entropy and proving that the mean subtraction is necessary in applying the ME and at the solution points determined by the MI, the ME will not update the demixing matrix in the directions of increasing the cross-talking. Second, the natural gradient instead of the ordinary gradient is introduced to obtain efficient algorithms, because the parameter space is a Riemannian space consisting of matrices. The mutual information is calculated by applying the Gram-Charlier expansion to approximate probability density functions of the outputs. Finally, we propose an efficient learning algorithm that incorporates with an adaptive method of estimating the unknown cumulants. It is shown by computer simulation that the convergence of the stochastic descent algorithms is improved by using the natural gradient and the adaptively estimated cumulants. 1 Introduction Let us consider the case where a number of observed signals are mixtures of the same number of stochastically independent source signals. It is desirable to recover the original source signals from the observed mixtures. The blind separation problem is to find a transform that recovers original signals without knowing how the sources are mixed. A learning algorithm for blind source separation belongs to the category of unsupervised factorial learning (Deco & Brauer, 1996). It is related to the redundancy reduction principle (Barlow & Foldi´ ¨ ak, 1989; Nadal & Parga, 1994), one of the principles possibly employed by the nervous system to extract the statistically independent features for a better representation of the input without losing information. When the mixing model is arbitrarily nonlinear, it is generally Neural Computation 9, 1457–1482 (1997)
c 1997 Massachusetts Institute of Technology °
1458
Howard Hua Yang and Shun-ichi Amari
impossible to separate the sources from the mixture unless further knowledge about the sources and the mixing model is assumed. In this article, we discuss only the linear mixing model for which the inverse transform (the demixing transform) is also linear. Maximum entropy (ME) (Bell & Sejnowski, 1995a) and minimum mutual information (MMI) or independent component analysis (ICA) (Amari, Cichocki, & Yang, 1996; Comon, 1994) are two major approaches to derive blind separation algorithms. By the MMI, the mutual information (MI) of the outputs is minimized to find the demixing matrix. This approach is based on established (ICA) theory (Comon, 1994). The MI is one of the best contrast functions since it is invariant under transforms such as scaling, permutation, and componentwise nonlinear transforms. Since scaling and permutation are indeterminacy in the blind separation problem, it is desirable to choose a contrast function that is invariant to these transforms. All the global minima (zero points) of the MI are possible solutions to the blind separation problem. By the ME approach, the outputs of the demixing system are first componentwise transformed by sigmoid functions, and then the joint entropy of the transformed outputs is maximized to find a demixing matrix. The online algorithm derived by the ME is concise and often effective in practice, but the ME has not yet been rigorously justified except for the case when the sigmoid functions happen to be the cumulative density functions of the unknown sources (Bell & Sejnowski, 1995b). We study the relation between the ME and MMI, and give a justification to the ME approach. In order to realize the MMI by online algorithms, we need to estimate the MI. To this end, we apply the Gram-Charlier expansion and the Edgeworth expansion to approximate the probability density functions of the outputs. We then have the stochastic gradient type online algorithm similar to the one obtained from the ME. In order to speed up the algorithm, we show that the natural gradient descent method should be used to minimize the estimated MI rather than the conventional gradient. This is because the stochastic gradient descent takes place in the space of matrices. This space has a natural Riemannian structure from which the natural gradient is obtained. We also apply this idea to improve the online algorithms based on the ME. Although the background philosophies of the ME and MMI are different, both approaches result in algorithms of a similar form with different nonlinearity. Both include unknown factors to be fixed adequately or to be estimated. They are activation functions in the case of ME and cumulants κ3 and κ4 in the case of MMI. Their optimal values are determined by the unknown probability distributions of the sources. The point is that the algorithms work even if the specification of the activation functions or cumulants is not accurate. However, efficiency of the algorithms becomes worse by misspecification. In order to obtain efficient algorithms, we propose an adaptive method
Adaptive Online Learning Algorithms
1459
together with online learning algorithms in which the unknown factors are adequately estimated. The performances of the adaptive online algorithms are compared with the fixed online algorithms based on simulation results. It is shown by simulations that the adaptive algorithms work much better in various examples. The blind separation problem is described in section 2. The relation between the ME and the MMI is discussed in section 3. The gradient descent algorithms based on ME and MMI are derived in section 4. The natural gradient is also shown, but its derivation is given in appendix B. The GramCharlier expansion and the Edgeworth expansion are applied in this section to estimate the MI. The adaptive online algorithms based on the MMI, together with an adaptive method of estimating the unknown cumulants, are proposed in section 5. The performances of the adaptive algorithms are compared with the fixed ones by the simulations in section 6. Finally, the conclusions are made in section 7. 2 Problem Let us consider n unknown source signals Si (t), i = 1, . . . , n, which are mutually independent at any fixed time t. We denote random variables by uppercase letters and their specific values by the same letters in lowercase. The bold uppercase letters denote random vectors or matrices. We assume that the sources Si (t) are stationary processes, each source has moments of any order, and at most one source is gaussian. We also treat visual signals where t should be replaced by the spatial coordinates (x, y) with Si (x, y) representing the brightness of the pixel at (x, y). The model for the sensor outputs is
X (t) = AS (t), where A ∈ Rn×n is an unknown nonsingular mixing matrix, S (t) = [S1 (t), . . . , Sn (t)]T and X (t) = [X1 (t), . . . , Xn (t)]T and T denote the transposition. Without knowing the source signals and the mixing matrix, we want to recover the original signals from the observations X (t) by the following linear transform,
Y (t) = W X (t), where Y (t) = [Y1 (t), . . . , Yn (t)]T and W ∈ Rn×n is a matrix. When W is equal to A−1 we have Y (t) = S (t). It is impossible to obtain the original sources Si (t) in exact order and amplitude because of the indeterminacy of permutation of {Si } and scaling due to the product of two unknowns: the mixing matrix A and the source vector S (t). Nevertheless, subject to a permutation of indices, it is possible to obtain the scaled sources ci Si (t) where the constants ci are nonzero scalar factors. The source signals are identifiable in this sense. Our goal is to find a demixing
1460
Howard Hua Yang and Shun-ichi Amari
matrix W adaptively so that [Y1 , . . . , Yn ] coincides with a permutation of scaled [S1 , . . . , Sn ]. In this case, this demixing matrix W is written as
W = ΛP A−1 , where Λ is a nonsingular diagonal matrix and P a permutation matrix. 3 Maximizing Entropy versus Minimizing Mutual Information There are two well-known methods for blind separation: (1) maximizing the entropy (ME) of the transformed outputs and (2) minimizing the mutual information (MMI) of Y so that its components become independent. We discuss the relation between ME and MMI. 3.1 Some Properties of Entropy and Mutual Information.PThe idea of ME originated from neural networks. Let us transform ya = j waj xj by a sigmoid function ga (y) to za = ga (ya ), which is regarded as the output from an analog neuron. Let Z = (g1 (Y1 ), . . . , gn (Yn )) be the componentwise transformed output vector by sigmoid functions ga (y), a = 1, . . . , n. It is expected that the entropy of the output Z is maximized when the components Za of Z are mutually independent. The blind separation algorithm based on ME (Bell & Sejnowski, 1995a) was derived by maximizing the joint entropy H(Z ; W ) with respect to W by using the stochastic gradient descent method. The joint entropy of Z is defined by Z H(Z ; W ) = − p(z ; W ) log p(z ; W )dz , where p(z ; W ) is the joint probability density function (pdf) of Z determined by W and {ga }. The nonlinear transformations ga (y) are necessary for bounding the entropy in a finite range. Indeed, when g(y) is bounded c ≤ g(y) ≤ d, for any random variable Y the entropy of Z = g(Y) has an upper bound: H(Z) ≤ log(d − c). Therefore, the entropy of the transformed output vector is upper bounded: H(Z ; W ) ≤
n X
H(Za ) ≤ n log(d − c).
(3.1)
a=1
In fact, the above inequality holds for any bounded transforms, so the global maximum of the entropy H(Z ; W ) exists. H(Z ; W ) may also have many local maxima determined by the functions {ga } used to transform Y . By studying the relation between ME and MMI, we shall prove that some of these maxima are the demixing matrices with the form ΛP A−1 .
Adaptive Online Learning Algorithms
1461
The basic idea of MMI is to choose W that minimizes the dependence among the components of Y . The dependence is measured by the KullbackLeibler divergence I(W ) between the joint pdf p(y ; W ) of Y and its factor˜ y ; W ), which is the product of the marginal pdf’s of Y : ized version p( Z ˜ y ; W )] = I(W ) = D[p(y ; W ) k p( ˜ y; W ) = where p( pa (ya ; W ) =
Qn Z
a=1 pa (ya ; W )
p(y ; W ) log
p(y ; W ) dy ˜ y, W ) p(
(3.2)
and pa (ya ; W ) is the marginal pdf of y ,
ˇ , . . . , dyn , p(y ; W )dy1 , . . . , dy a
ˇ denoting that dya is missing from dy = dy1 , . . . , dyn . dy a The Kullback-Leibler divergence (see equation 3.2) gives the MI of Y and is written in terms of entropies, I(W ) = −H(Y; W ) +
n X
H(Ya ; W ),
(3.3)
a=1
where
R H(Y; W ) = − p(y ; W ) log p(y ; W )dy ,
and
R H(Ya ; W ) = − pa (ya ; W ) log pa (ya ; W )dya is the marginal entropy.
It is easy to show that I(W ) ≥ 0 and is equal to zero if and only if Ya are independent. As it is proved in Comon (1994), I(W ) is a contrast function for the ICA, meaning, I(W ) = 0 iff W = ΛP A−1 . In contrast to the entropy criterion, the mutual information I(W ) is invariant under componentwise monotonic transformations, scaling, and permutation of Y . Let {gi (y)} be differentiable monotonic functions and Z = (g1 (Y1 ), . . . , gn (Yn )) be the transformed outputs. It is easy to prove that the mutual information I(W ) is the same for Y and Z . This implies that the componentwise nonlinear transformation is not necessary as a preprocessing if we use the MI as a criterion. 3.2 Relation Between ME and MMI. The blind separation algorithm derived by the ME approach is concise and often very effective in practice. However, the ME is not rigorously justified except when the sigmoid transforms happen to be the cumulative distribution functions (cdf’s) of the unknown sources.
1462
Howard Hua Yang and Shun-ichi Amari
It is discussed in Nadal and Parga (1994) and Bell and Sejnowski (1995a) that the ME does not necessarily lead to a statistically independent representation. To justify the ME approach, we study how ME is related to MMI. We prove that when all sources have a zero mean, ME gives locally correct solutions; otherwise it is not true in general. Let us first consider the relation between the entropy and the mutual information. Since Z is a componentwise nonlinear transform of Y , it is easy to obtain H(Z ; W ) = H(Y ; W ) +
n Z X
dya p(ya ; W ) log g0a (ya )
a=1
= −I(W ) +
n X
H(Ya , W ) +
a=1
= −I(W ) −
n X
n Z X
dya p(ya ; W ) log g0a (ya )
a=1
D[p(ya ; W ) k g0 (ya )].
a=1
Hence, we have the following equation: ˜ y ; W ) k g0 (y )] −H(Z ; W ) = I(W ) + D[p(
(3.4)
where g 0 (y ) =
n Y
g0a (ya )
a=1
is regarded as a pdf of an independent random vector provided, for each a, ga (−∞) = 0, ga (∞) = 1, and g0a (y) is the derivative of ga (y). Let {ra (sa )} be the pdf’s of the independent sources Sa (t) at any t. The joint Q pdf of the sources is r(s) = na=1 ra (sa ). When W = A−1 , we have Y = S so that p(ya ; A−1 ) = ra (ya ). Hence, from equation 3.4, it is straightforward that if {ga (ya )} are the cdf’s of the sources, r(y ) = g0 (y ) so that D[r(y ) k g0 (y )] = 0. In this case, H(Z ; A−1 ) = −I(A−1 ) = 0. Since H(Z ; W ) ≤ 0 from equation 3.1, the entropy H(Z ; W ) achieves the global maximum at W = A−1 . Let r˜(y ) be the joint pdf of the scaled and permutated sources y = ΛP s. It is easy to prove that H(Z ; W ) achieves the global maximum at W = ΛP A−1 too. This was also discussed in Nadal and Parga (1994) and Bell and Sejnowski (1995b), from different perspectives. The above formula (see equation 3.4) can also be derived from the formula 24 in Nadal and Parga (1994). We now study the general case where g0 (y ) is different from r(y ). We decompose H(Z ; W ) as −H(Z ; W ) = I(W ) + D(W ) + C(W ),
(3.5)
Adaptive Online Learning Algorithms
1463
where I(W ) = I(y ; W ), ˜ y , W ) k r(y )], D(W ) = D[p( n Z n Z X X C(W ) = dya p(ya ; W ) log ka (ya ) = dy p(y ; W ) log ka (ya ), a=1
a=1
ra (ya ) ka (ya ) = 0 . ga (ya ) Since p(ya , A−1 ) = ra (ya ), I(W ) and D(W ) take the minimum value zero at W = A−1 . To understand the behavior of C(W ), we need to compute its gradient. To facilitate the process for calculating the gradient, we reparameterize W as
W = (I + B )A−1 around W = A−1 so that Y = (I + B )S . We call the diagonal elements {Bbb } the scaling coordinates and the off-diagonal elements {Bbc , b 6= c} the cross-talking coordinates. If ∂C/∂Bbc 6= 0 for some b 6= c at W = A−1 , then the gradient descent algorithm based on the ME would increase crosstalking around this point. If ∂C/∂Bbc = 0 for all b 6= c, even if ∂C/∂Bbb 6= 0 for some b, the ME method still gives a correct solution for any nonlinear functions ga (y). We have the following lemma: Lemma 1.
At B = 0 or W = A−1 , for b 6= c, the gradient of C(W ) is
∂C =− ∂Bbc
µZ
¶ µZ dyc yc r(yc )
¶ dyb r0b (yb )rb (yb ) log kb (yb ) .
(3.6)
The proof is given in appendix A. The diagonal terms ∂C/∂Bbb do not vanish in general. It is easy to see from equation 3.6 that at W = A−1 the derivatives {∂C/∂Bbc , b 6= c} vanish when all the sources have a zero mean, Z dsa sa r(sa ) = 0, or all the sigmoid functions ga are equal to the cdf’s of the sources, that is, g0 (y ) = r(y ). Similarly, reparameterizing the function C(W ) around around W = ΛP A−1 using
W = (I + B )ΛP A−1 , we calculate the gradient ∂C/∂ B at B = 0. It turns out that equation 3.6 still holds after permutating indexes, and the derivatives {∂C/∂Bbc , b 6= c}
1464
Howard Hua Yang and Shun-ichi Amari
vanish when all sources have a zero mean or g0 (y ) = r˜(y ), which is the joint pdf of scaled and permutated sources. Hence, we have the following theorem regarding the relation between the ME and the MMI: Theorem 1. When all sources are zero mean signals, at all solution points W = ΛP A−1 determined by the contrast function MI, the ME algorithms will not update the demixing matrix in the directions of increasing the cross-talking. When one of sources has a nonzero mean, the entropy is not maximized in general at W = ΛP A−1 . Note the entropy is maximized at W = ΛP A−1 when the sigmoid functions are equal to the cdf’s of the scaled and permutated sources. But usually we cannot choose the unknown cdf’s as the sigmoid functions. In some blind separation problems such as the separation of visual signals, the means of mixture are positive. In such cases, we need a preprocessing step to centralize the mixture:
xt − xt = A(st − st ),
(3.7)
P where xt = 1t tu=1 xu and st is similarly defined. The mean subtraction can also be implemented by online thresholds for the sigmoid functions applied to the outputs. Applying a blind separation algorithm, we find a matrix W close to one of solution points ΛP A−1 such that W (xt − xt ) ≈ ΛP (st − st ), and then obtain the sources by
W xt = W (xt − xt ) + W xt ≈ ΛP st . Since every linear mixture model can be reformulated as equation 3.7, without losing generality, we assume that all sources have a zero mean in the rest of this article. 4 Gradient Descent Algorithms Based on ME and MMI The gradient descent algorithms based on ME and MMI are the following: dW ∂H(Z ; W ) =η , dt ∂W
(4.1)
∂I(W ) dW = −η . dt ∂W
(4.2)
However, the algorithms work well if we put a positive-definite operator G
Adaptive Online Learning Algorithms
1465
on matrices: ∂H(Z ; W ) dW = ηG ? dt ∂W
(4.3)
∂I(W ) dW = −ηG ? . dt ∂W
(4.4)
The gradients themselves are not obtainable in general. In practice, they are replaced by the stochastic ones whose expectations give the true gradients. This method, called the stochastic gradient descent method, has been known in the neural network community for a long time (e.g., Amari, 1967). Some remarks on the the stochastic gradient and backpropagation are given in Amari (1993). 4.1 Stochastic Gradient Descent Method Based on ME. The entropy H(Z ; W ) can be written as H(Z ; W ) = H(X ) + log |W | +
n X
E[log g0a (Ya )],
a=1
where |W | = | det(W )|. It is easy to show ∂ log |W | = W −T ∂W where W −T = (W −1 )T . Moreover, P ∂ a=1 log g0a (ya ) = −Φ(y )xT , ∂W where ¶T µ 00 g (y1 ) g00 (yn ) · · · − n0 Φ(y ) = − 10 . g1 (y1 ) gn (yn )
(4.5)
Hence, the expectation of the instantaneous values of the stochastic gradient c ∂H = W −T − Φ(y )xT ∂W
(4.6)
gives the gradient ∂H/∂ W . Based on this, the following algorithm proposed by Bell & Sejnowski (1995a), is obtained from equations 4.3 and 4.6: dW = η(W −T − Φ(y )xT ). dt
(4.7)
Here, the learning equation is in the form of equation 4.3 with the operator G equal to the identity matrix. We show a more adequate G later. When the
1466
Howard Hua Yang and Shun-ichi Amari
Rx transforms are tanh(x) and −∞ exp(− 14 u4 )du, the corresponding activation functions ga are 2tanh(x) and x3 , respectively. We show in appendix B that the natural choice of the operator G should be
G?
∂H ∂H W TW . = ∂W ∂W
This is given by the Riemannian structure of the parameter space of matrices W (Amari, 1997). We then obtain the following algorithm from equation 4.7, dW = η(I − Φ(y )y T )W , dt
(A)
which is not only computationally easy but also very efficient. The learning rule of form (A) was proposed in Cichocki, Unbehaven, Moszczynski, ´ and Rummert (1994) and later justified in Amari, Cichocki, & Yang, (1996). The learning rule (A) has two important properties: the equivariant property (Cardoso & Laheld, 1996) and the property of keeping W (t) from becoming singular, whereas the learning rule (see equation 4.7) does not have these properties. To prove the second property, we define hX , Y i = tr(X T Y ) and calculate ¿ À D E ∂|W | dW d|W | = = |W |W −T , η(I − 8(y )y T )W , dt ∂W dt à ! n X T (1 − φi (yi )yi ) |W |. = ηtr(I − 8(y )y )|W | = η i=1
Then, we obtain an expression for |W (t)| ( Z ) n tX (1 − φi (yi (τ ))yi (τ ))dτ , |W (t)| = |W (0)| exp η
(4.8)
0 i=1
from which we know that |W (t)| 6= 0 if |W (0)| 6= 0. This means that {X ∈ Rn×n : |X | 6= 0} is an invariant set of the flow W (t) driven by the learning rule (A). It is worth pointing out that algorithm 4.7 is equivalent to the sequential maximum likelihood algorithm when ga are taken to be equal to ra . For r(s) = r1 (s1 ) . . . rn (sn ), the likelihood function is log p(x; W ) = log(r(W x)|W |) =
n X a=1
log ra (ya ) + log |W |,
Adaptive Online Learning Algorithms
1467
from which we have ∂ log p(x; W ) r0 (ya ) xk + (W −T )ak , = a ∂wak ra (ya ) and in the matrix form, ∂ log p(x; W ) = W −T − 9(y )xT , ∂W where
¶T µ 0 r (y1 ) r0 (yn ) ,...,− n 9(y ) = − 1 r1 (y1 ) rn (yn )
and wak is the (a, k) elements of W . Using the natural (Riemannian) gradient descent method to maximize log p(x; W ), we have the sequential maximum likelihood algorithm of the type in equation (A): dW = µ(I − 9(y )y T )W . dt
(4.9)
From the point of view of asymptotic statistics, this gives the Fisher-efficient estimator. However, we do not know ra or 9(y ), so we need to use the adaptive method of estimating 9(y ) in order to implement equation 4.9. 4.2 Approximation of MI. To implement the MMI, we need some formulas to approximate the MI. Generally it is difficult to obtain the explicit form of the MI since the pdf’s of the outputs are unknown. The main difficulty is calculating the marginal entropy H(Ya ; W ) explicitly. In order to estimate the marginal entropy, the Edgeworth expansion and the GramCharlier expansion were used in Comon (1994) and Amari et al. (1996), respectively, to approximate the pdf’s of the outputs. The two expansions are formally identical except for the order of summation, but the truncated expansions are different. We show later by computer simulation that the Gram-Charlier expansion is superior to the Edgeworth expansion for blind separation. In order to calculate each H(Ya ; W ) in equation 3.3, we shall apply the Gram-Charlier expansion to approximate the pdf pa (ya ). Since we assume E[s] = 0, we have E[y] = E[W As] = 0 and E[ya ] = 0. To simplify the calculations for the entropy H(ya ; W ), to be carried out later, we assume ma2 = E[y2a ] = 1 for all a. Under this assumption, we approximate each marginal entropy first and then obtain a formula for the MI. The zero mean and unit variance assumption makes it easier to calculate the MI. However, the formula obtained for the MI can be used in more general cases. When the components of Y have nonzero means and different variances, we can shift and scale Y to obtain
Y˜ = Σ−1 (Y − m),
1468
Howard Hua Yang and Shun-ichi Amari
where Σ = diag(σ1 , . . . , σn ), σa is the variance of Ya , and m = E[Y ]. The components of Y˜ have the zero mean and the unit variance. Due to the invariant property of the MI, the formula for computing the MI of Y˜ is applicable to compute the MI of Y . We use the following truncated Gram-Charlier expansion (Stuart & Ord, 1994) to approximate the pdf pa (ya ): ½ ¾ κa κa pa (ya ) ≈ α(ya ) 1 + 3 H3 (ya ) + 4 H4 (ya ) , 3! 4!
(4.10)
where κ3a = ma3 and κ4a = ma4 − 3 are the third- and fourth-order cumulants of Ya , respectively, and mak = E[yka ] is the kth order moment of Ya , α(y) = √1 e− 2π
y2 2
, and Hk (y), k = 1, 2, . . . , are the Chebyshev-Hermite polynomials defined by the identity (−1)k
dk α(y) = Hk (y)α(y). dyk
The Gram-Charlier expansion clearly shows how κ3a and κ4a affect the approximation of the pdf. The last two terms in equation 4.10 characterize the deviations from the gaussian distributions. To apply equation 4.10 to calculate H(Ya ), we need the following integrals: Z α(y)(H3 (y))2 H4 (y)dy = 3!3
(4.11)
α(y)(H4 (y))3 dy = 123 .
(4.12)
Z
These integrals can be obtained easily from the following results for the moments of a gaussian random variable N(0, 1): Z
Z y2k+1 α(y)dy = 0,
y2k α(y)dy = 1 · 3 · · · (2k − 1).
(4.13)
By using the expansion log(1 + y) ≈ y −
y2 + O(y3 ) 2
and taking account of the orthogonality relations of the Chebyshev-Hermite polynomials and equations 4.11 and 4.12, the entropy H(Ya ; W ) is approximated by H(Ya ; W ) ≈
(κ a )2 (κ a )2 3 1 1 log(2πe) − 3 − 4 + (κ3a )2 κ4a + (κ4a )3 . (4.14) 2 2 · 3! 2 · 4! 8 16
Adaptive Online Learning Algorithms
1469
Let F(κ3a , κ4a ) denote the right-hand side of the approximation in equation 4.14. From Y = W X , we have H(Y ) = H(X ) + log |W |. Applying this expression and equation 4.14 to equation 3.3, we have I(Y ; W) ≈ −H(X ) − log |W | +
n X n log(2π e) + F(κ3a , κ4a ). 2 a=1
(4.15)
On the other hand, the following approximation of the marginal entropy (Comon, 1994) is obtained by using the Edgeworth expansion of pa (ya ): H(Ya ; W ) ≈
1 1 1 log(2πe) − (κ3a )2 − (κ a )2 2 2 · 3! 2 · 4! 4 −
7 a 4 1 a 2 a (κ ) + (κ3 ) κ4 . 48 3 8
(4.16)
The terms in the Edgeworth expansion are arranged in a decreasing order by assuming that κia is of order n(2−i)/2 . This is true when Ya is a sum of n independent random variables and the number n is large. To obtain the formula in equation 4.16, the truncated Edgeworth expansion is used by neglecting those high-order terms higher than 1/n2 . This is a standard method of asymptotic statistics but is not valid in the present context because of the fixed n. Moreover, at around W = A−1 , Ya is close to Sa , which is not a sum of independent random variables, so that the Edgeworth expansion is not justified in this case. The learning algorithm derived from equation 4.16 does not work well in our simulations. The reason is that the cubic of the fourth-order cumulant (κ4a )3 plays an important role but is omitted in the equation. It is a small term from the point of view of the Edgeworth expansion but not a small term in the present context. In our simulations, we deal with nearly symmetric source signals, so (κ3a )4 is small. In this case, we should use the following entropy approximation instead of equation 4.16: H(Ya ; W ) ≈
1 1 1 log(2πe) − (κ3a )2 − (κ a )2 2 2 · 3! 2 · 4! 4 1 1 + (κ3a )2 κ4a + (κ4a )3 . 8 48
(4.17)
The learning algorithm derived from the above formula works almost as well as equation 4.14. Note that the expansion formula can be more general. Instead of the gaussian kernel, we can use another standard distribution function as the kernel to expand pa (ya ) as: ( ) X µi Ki (y) , pa (y) = β(y) 1 + i
1470
Howard Hua Yang and Shun-ichi Amari
where Ki (y) are the orthogonal polynomials with respect to the kernel β(y). For example, the expansion corresponding to the kernel ( e−y , y ≥ 0 β(y) = 0, y>0 will be better than the Gram-Charlier expansion in approximating the pdf of a positive random variable. The orthogonal polynomials corresponding to this kernel are Laguerre polynomials. 4.3 Stochastic Gradient Method Based on MMI. To obtain the stochastic gradient descent algorithm to update W recursively, we need to calculate the gradient of I(W ) with respect to W . Since the exact function form of I(W ) is unknown, we calculate the derivative using the approximated MI. Since ∂κ3a = 3E[y2a xk ] ∂wak and ∂κ4a = 4E[y3a xk ], ∂wak we obtain the following from equation 4.15: ∂I(W ) ≈ −(W −T )ak + f (κ3a , κ4a )E[y2a xk ] + g(κ3a , κ4a )E[y3a xk ], ∂wak
(4.18)
where 9 1 f (y, z) = − y + yz, 2 4
1 3 3 g(y, z) = − z + y2 + z2 . 6 2 4
(4.19)
Removing the expectation symbol E(·) in equation 4.18 and writing it in a matrix form, we obtain the stochastic gradient: b ∂I ∂W
= −W −T + (f (κ3 , κ4 ) ◦ y 2 )xT + (g (κ3 , κ4 ) ◦ y 3 )xT ,
(4.20)
whose expectation gives ∂I/∂ W , where ◦ denotes the Hadamard product of two vectors
f ◦ y = ( f1 y1 , . . . , fn yn )T , and y k = [(y1 )k , . . . , (yn )k ]T for k = 2, 3, f (κ3 , κ4 ) = [ f (κ31 , κ41 ), . . . , f (κ3n , κ4n )]T , g (κ3 , κ4 ) = [g(κ31 , κ41 ), . . . , g(κ3n , κ4n )]T .
Adaptive Online Learning Algorithms
1471
We need to evaluate κ3 and κ4 for implementing this idea. If κ3 and κ4 are known, using the natural gradient descent method to minimize I(W ), from equation 4.20 we obtain the algorithm based on MMI: dW = η(t){I − Φκ (y )y T }W , dt
(A0 )
where
Φκ (y ) = h(y ; κ3 , κ4 ) = f (κ3 , κ4 ) ◦ y 2 + g (κ3 , κ4 ) ◦ y 3 .
(4.21)
Note that each component of Φκ (y ) is a third-order polynomial. In partic11 3 y . ular, if κ3i = 0 and κ4i = −1, the nonlinearity Φκ (y ) becomes 12 0 The algorithm (A ) is based on the entropy formula (see equation 4.14) derived from the Gram-Charlier expansion. Using the entropy formula (see equation 4.17) based on the Edgeworth expansion rather than the formula in equation 4.14, we obtain an algorithm, to be referred to as (A00 ). This algorithm has the same form as the algorithm (A0 ) except for the definition of the functions f and g. For the Edgeworth expansion-based algorithm (A00 ), instead of equation 4.19 we use the following definition for f and g: 7 3 1 f (y, z) = − y + y3 + yz, 2 4 4
1 1 g(y, z) = − z + y2 . 6 2
(4.22)
So the algorithm of type (A) is the common form for both the ME algorithm and the MMI algorithm. In the ME approach, the function Φ(y ) in type (A) is determined by the sigmoid transforms {ga (ya )}. If we use the cdf’s of the source distributions, we need to evaluate the unknown ra (sa ). In the MMI approach, the function Φκ (y ) = h(y ; κ3 , κ4 ) depends on the cumulants κ3a and κ4a , and the approximation procedures should be taken for computing the entropy. It is better to choose g0a equal to unknown ra or to choose κ3a and κ4a equal to the true values. However, it is verified by numerous simulations that even if the g0a or κ3a and κ4a are misspecified, the algorithms (A) and (A0 ) may still converge to the true separation matrix in many cases. 5 Adaptive Online Learning Algorithms In order to implement an efficient learning algorithm, we propose the following adaptive algorithm to trace κ3 and κ4 together with (A0 ): dκ3a = −µ(t)(κ3a − (ya )3 ), dt dκ4a = −µ(t)(κ4a − (ya )4 + 3), a = 1, . . . , n, dt where µ(t) is a learning rate function.
(5.1)
1472
Howard Hua Yang and Shun-ichi Amari
It is possible to use the same adaptive scheme for implementing the efficient ME algorithm. In this case, we use the Gram-Charlier expansion (see equation 4.10) to evaluate ra (ya ) or 9(y ). The performances of the algorithms (A0 ) with equation 5.1 will be compared with that of the algorithms (A) without adaptation in the next section by simulation. The simulation results demonstrate that the adaptive algorithm (A0 ) needs fewer samples to reach the separation than the fixed algorithm (A). 6 Simulation In order to show the effect of the online learning algorithms (A0 ) with equation 5.1 and compare its performance with that of the fixed algorithm (A), we apply these algorithms to separate sources from the mixture of modulating signals or the mixture of visual signals. For the algorithm A, we can use various fixed Φ(y ). For example, we can use one of the following functions 1. f (y) = y3 . 2. f (y) = 2 tanh(y). 3. f (y) = 34 y11 +
15 9 4 y
−
14 7 3 y
−
29 5 4 y
+
29 3 4 y
to define Φ(y ) = ( f (y1 ), . . . , f (yn ))T . Function 3 was used in Amari et al. (1996) by replacing κ3a and κ4a by their instantaneous values: κ3a ∼ (ya )3 ,
κ4a ∼ (ya )4 − 3.
We also compare the adaptive algorithm obtained from the Gram-Charlier expansion (A0 ) with the adaptive algorithm obtained from the Edgeworth expansion (A00 ). 6.1 Modulating Source Signals. Assume that the following five unknown sources are mixed by a mixing matrix A randomly chosen:
s(t) = [sign(cos(2π155t)), sin(2π 800t), sin(2π 300t + 6 cos(2π 60t)), sin(2π 90t), n(t)]T ,
(6.1)
where four components of s(t) are modulating data signals, and one component n(t) is a noise source uniformly distributed in [−1, +1]. The elements of the mixing matrix A are randomly chosen subject to the uniform distribution in [−1, +1] such that A is nonsingular. We use the cross-talking error defined below to measure the performance of the algorithms: ! Ã n n n n X X X X |pij | |pij | − 1 + −1 E= maxk |pik | maxk |pkj | i=1 i=1 j=1 j=1
Adaptive Online Learning Algorithms
1473
where P = (pij ) = WA. The mixed signals are sampled at the sampling rate of 10K Hz. Taking 2000 samples, we simulate the algorithm (A) with fixed Φ defined by the functions 1 and 2, and the algorithms (A0 ) and (A00 ) with the adaptive Φκ defined by equation 4.21 for which the functions f and g are defined by equations 4.19 and 4.22, respectively. The learning rate is a constant. For each algorithm, we select a nearly optimal learning rate. The initial matrix is chosen as W (0) = 0.5I in all simulations. The sources, mixtures, and outputs obtained by using the different algorithms are displayed in Figure 1. In each part, four out of five components are shifted upward from the zero level for a better illustration. The sources, mixtures, and all the outputs shown are within the same time window [0.1, 0.2]. It is shown in Figure 1 that the algorithms with adaptive Φκ need fewer samples than those with fixed Φ to achieve separation. The cross-talking errors are plotted in Figure 2, which shows that the adaptive MMI algorithms (A0 ) and (A00 ) outperform the ME algorithm (A) with either fixed function 1 or 2. The cross-talking error for each algorithm can be decreased further if more observations are taken and an exponentially decreasing learning rate is chosen. 6.2 Image Data. In this article, the two basic algorithms (A) and (A0 ) are derived by ME and MMI. The assumption that sources are independent is needed as a sufficient condition for identifiability. To check whether the performance of these algorithms is sensitive to the independence assumption, we test these algorithms on correlated data such as images. Images are usually correlated and nonstationary in the spatial coordinate. In the top row in Figure 3, we have six image sources {si (x, y), i = 1, . . . , 6} consisting of five natural images and one uniformly distributed noise. All images have the same size with N pixels in each of them. From each image, a pixel sequence is taken by scanning the image in a certain order, for example, row by row or column by column. These sources have the following correlation coefficient matrix:
Rs = (cij ) 1.0000 0.0819 −0.1027 −0.0616 0.2154 0.0106 0.0819 1.0000 0.3158 0.0899 −0.0536 −0.0041 −0.1027 0.3158 1.0000 0.3194 −0.4410 −0.0102 = 0.0899 0.3194 1.0000 −0.0692 −0.0058 −0.0616 0.2154 −0.0536 −0.4410 −0.0692 1.0000 0.0215 0.0106 −0.0041 −0.0102 −0.0058 0.0215 1.0000 , where the correlation coefficients cij are defined by P x,y (si (x, y) − si )(sj (x, y) − sj ) cij = Nσi σj
1474
0.1
Howard Hua Yang and Shun-ichi Amari
0.12
0.14
0.16
0.18
0.2
0.1
0.12
(a)
0.1
0.12
0.14
0.12
0.16
0.14
0.18
0.2
0.18
0.2
0.1
0.12
0.14
0.16
0.18
0.2
0.16
0.18
0.2
(d)
0.16
(e)
0.16
(b)
(c)
0.1
0.14
0.18
0.2
0.1
0.12
0.14
(f)
Figure 1: The comparison of the separation by the algorithms (A), (A0 ), and (A00 ). (a) The sources. (b) The mixtures. (c) The separation by (A0 ) using learning rate µ = 60. (d) The separation by (A00 ) using learning rate µ = 60. (e) The separation by (A) using x3 and µ = 65. (f) The separation by (A) using 2tanh(x) and µ = 30.
Adaptive Online Learning Algorithms
1475
30 (A) with 2tanh(x) 25
(A) with x^3 (A’’) (A’)
error
20
15
10
5
0 0
200
400
600
800
1000 1200 iteration
1400
1600
1800
2000
Figure 2: Comparison of the performances of (A) (using x3 or 2tanh(x)), and (A0 ) and (A00 ).
si =
1 X si (x, y). N x,y
σi2 =
1 X (si (x, y) − si )2 . N x,y
It is shown by the above source correlation matrix that each natural image is correlated with one or more other natural images. Mixing the image sources by the following matrix, 1.0439 1.0943 0.9927 1.0955 1.0373 0.9722 1.0656 0.9655 0.9406 1.0478 1.0349 1.0965 1.0913 0.9555 0.9498 0.9268 0.9377 0.9059 A= 1.0232 1.0143 0.9684 1.0440 1.0926 1.0569 0.9745 0.9533 1.0165 1.0891 1.0751 1.0321 0.9114 0.9026 1.0283 0.9928 1.0711 1.0380 , we obtain six mixed pixel sequences; their images are shown in the second
1476
Howard Hua Yang and Shun-ichi Amari
Figure 3: Separation of mixed images. The images in the first row are sources; those in the second row are the mixed images; those in the third row are separated images.
row in Figure 3. The mixed images are highly correlated with the correlation coefficient matrix, Rm =
1.0000 0.9966 0.9988 0.9982 0.9985 0.9968
0.9966 1.0000 0.9963 0.9995 0.9989 0.9989
0.9988 0.9963 1.0000 0.9972 0.9968 0.9950
0.9982 0.9995 0.9972 1.0000 0.9996 0.9993
0.9985 0.9989 0.9968 0.9996 1.0000 0.9996
0.9968 0.9989 0.9950 0.9993 0.9996 1.0000
.
To compute the demixing matrix, we apply the algorithm (A0 ) and use only 20 percent of the mixed data. The separation result is shown in the third row in Figure 3. If the outputs are arranged in the same order as the sources, the
Adaptive Online Learning Algorithms
correlation coefficient matrix of the output is 1.0000 −0.0041 −0.1124 −0.1774 −0.0041 1.0000 0.1469 0.0344 −0.1124 0.1469 1.0000 0.1160 Ro = −0.1774 0.0344 0.1160 1.0000 0.0663 −0.1446 −0.1683 −0.1258 0.1785 −0.0271 −0.0523 0.0946
1477
0.0663 −0.1446 −0.1683 −0.1258 1.0000 −0.1274
0.1785 −0.0271 −0.0523 0.0946 −0.1274 1.0000 .
Applying algorithm (A00 ), we obtain a separation result similar to that in Figure 3. It is not unusual that dependent sources can sometimes be separated by the algorithms (A0 ) and (A00 ) since the MI of the mixtures is usually much higher than the MI of the sources. Applying algorithms (A0 ) or (A00 ), the MI of the transformed mixture is decreased. When it reaches a level similar to that of the sources, the dependent sources can be extracted. In this case, some prior knowledge about sources (e.g., human face and speech) is needed to select the separation results. 7 Conclusion The blind separation algorithm based on the ME approach is concise and often very effective in practice, but it is not yet fully justified. The MMI approach, on the other hand, is based on the contrast function MI, which is well justified. We have studied the relation between the ME and MMI and prove that the ME will not update the demixing matrix in the directions of increasing the cross-talking at the solution points when all sources are zero mean signals. When one of sources has a nonzero mean, the entropy is not maximized in general at the solution points. It is suggested by this result that the mixture model should be reformulated such that all sources have a zero mean in order to use the ME approach. The two basic MMI algorithms (A0 ) and (A00 ) have been derived based on the minimization of the MI of the outputs. The contrast function MI is impractical unless we approximate it. The difficulty is evaluating the marginal entropy of the outputs. The Gram-Charlier expansion and the Edgeworth expansion are applied to estimate the MI. The natural gradient method is used to minimize the estimated MI to obtain the basic algorithm. These algorithms have been tested for separating unknown source signals mixed by a mixing matrix randomly chosen, and the validity of the algorithms has been verified. Because of the approximation error, the MMI algorithms have limitations; they cannot replace the ME. There are cases such as the experiment in Bell and Sejnowski (1995a) using speech signals in which the ME algorithm performs better. Although the ME and the MMI result in very similar algorithms, it is unknown whether the two approaches are equivalent globally. The activation functions in algorithm A and the cumulants in algorithm A0 should be de-
1478
Howard Hua Yang and Shun-ichi Amari
termined adequately. We proposed an adaptive algorithm to estimate these cumulants. It is observed from many simulations that the adaptive online algorithms (A0 ) and (A00 ) need fewer samples than the online algorithm (A) (with nonlinear functions such as 2 tanh(x) and x3 ) to reach the separation. The test of algorithms (A0 ) and (A00 ) using the image data suggests that even dependent sources can sometimes be separated if some prior knowledge is used to select the separation results. The reason is that the MI of the mixtures is usually higher than that of sources and the algorithms (A0 ) and (A00 ) can decrease the MI from a high level to a lower one. Appendix A: Proof of Lemma 1 Let kB k ≤ ε < 1, then |I + B |−1 = 1 − tr(B ) + O(ε 2 ) and (I + B )−1 = I − B + O(ε2 ). So p(y ; W ) = |I + B |−1 r((I + B )−1 y ) ! Ã n X Y 2 ra ya − Bac yc + O(ε ) = (1 − tr(B )) c
a=1
= (1 − tr(B ))
n Y
ra (ya ) −
= r(y ) − tr(B ) +
r0a (ya )
a
a=1
+ O(ε 2 ) "
X
X
n Y
rb (yb )
X
b6=a
Bac yc
c
# l0a (ya )Bac yc
r(y ) + O(ε 2 )
(A.1)
a,c
where l0a (ya ) =
r0a (ya ) . ra (ya )
From equation A.1, we have the linear expansion of C(W ) = C((I + B )A−1 ) at B = 0, " # XZ X −1 0 C((I + B )A ) ≈ − lb (yb )Bbc yc r(y ) log ka (ya ), dy tr(B ) + a
b,c
from which we calculate ∂C/∂ B at B = 0. When b 6= c, XZ ∂C =− dy yc r(y )l0b (yb ) log ka (ya ) ∂Bbc a ¶ µZ ¶ X µZ 0 =− dyb rb (yb ) D[ra (ya )kg0a (ya )] dyc yc r(yc ) a6=c,a6=b
Adaptive Online Learning Algorithms
Ã
Z −
Y
dy yc rc (yc )
i6=c
µZ −
R
ri (yi ) l0b (yb ) log kc (yc )
¶ µZ ¶ µZ
µZ
since
!
dyc yc r(yc ) dyc yc r(yc )
=−
dyr0b (y) = 0 and (
Q
i6=c
1479
¶
dyb r0b (yb )rb (yb ) log kb (yb )
¶
dyb r0b (yb )rb (yb ) log kb (yb )
ri (yi ))l0b (yb ) = (
Q
i6=c,i6=b
(A.2)
ri (yi ))r0b (yb ).
Appendix B: Natural Gradient Denote Gl(n) = {X ∈ Rn×n : det(X ) 6= 0}. Let W ∈ Gl(n) and φ(W ) be a scalar function. Before we define the natural gradient of φ(W ), we recapitulate the gradient in a Riemannian manifold S in the tensorial form. Let x = (xi ), i = 1, . . . , m, be its (local) coordinates. The gradient of φ(x) is written as ¶ µ ∂φ = (ai ), ∇φ = ∂xi which is a linear operator (or a covariant vector) mapping a vector (contravariant vector) X = (Xi ) to real values in R, X ai Xi . ∇φ ◦ X = The manifold S has the Riemannian metric gij (x), by which the inner product of X and Y is written as X gij Xi Y j . hX , Y i = ˜ = a˜ be the contravariant version of a = ∇φ, such that Let ∇φ ˜ Xi ∇φ ◦ X = h∇φ, or
X
ai X i =
X
gij a˜ i X j .
So we have X gij aj , a˜ i = where (gij ) is the inverse matrix of (gij ). It is possible to give a more direct meaning to a˜ i . We search for the steepest direction of φ(x) at point x. Let ε > 0 be a small constant, and we study the direction d such that
d˜ = argmax φ(x + εd),
1480
Howard Hua Yang and Shun-ichi Amari
under the constraint that d is a unit vector, X gij di d j = 1. Then it is easy to calculate that d˜i = a˜ i . It is natural to define the gradient flow in S by ˜ = −η x˙ = −η∇φ
X
gij
∂ φ(x). ∂x j
Now we return to our manifold Gl(n) of matrices. We define the natural gradient of φ(W ) by imposing a Riemannian structure in Gl(n). The manifold Gl(n) of matrices has the Lie group structure: any A ∈ Gl(n) maps Gl(n) to Gl(n) by W → W A, where A = E (unit matrix) is the unit. We impose that the Riemannian structure should be invariant by this operation A. More definitely, at W , we consider a small deviation of W , W + εZ , where ε is infinitesimally small. Here, Z is the tangent vector at W (or an element of the Lie algebra). We introduce an inner product hZ , Z iW at W . Now, W is mapped to the unit matrix E by the operation of A = W −1 . Then W and W + εZ are mapped to W A = W W −1 = E and (W + εZ )W −1 = E + εZW −1 , respectively. So the tangent vector Z at W corresponds to the tangent vector ZW −1 at E . They should have the same length (the Lie group invariance: the same as the invariance under the basis change of vectors on which a matrix acts). So hZ , Z iW = hZW −1 , ZW −1 iE . It is very natural to define the inner product at E by X hY , Y iE = tr(Y T Y ) = (Yij )2 , because we have no specific preference in the components at E . Then we have hZ , Z iW = hZW −1 , ZW −1 iE = tr(W −T Z T ZW −1 ). On the other hand, the gradient operator ∂φ is defined by φ(W + ε Z ) = φ(W ) + ε∂φ ◦ Z , X ∂φ ∂φ ◦ Z = tr(∂φ T Z ) = Zij . ∂Wij
Adaptive Online Learning Algorithms
1481
˜ is Hence, the contravariant version ∂φ ˜ Zi ∂φ ◦ Z = h∂φ, W ˜ T ZW −1 ) = tr(W −T ∂φ ˜ T Z ), = tr(W −1 W −T ∂φ giving ˜ T ∂φ T = W −1 W −T ∂φ or ˜ W −1 W −T ∂φ = ∂φ that is, ˜ = ∂φ W T W . ∂φ Hence, the natural gradient descent algorithm should be dW = −η(∇φ)W T W . dt The natural gradient (∇φ)W T W is equivalent to the relative gradient introduced in Cardoso and Laheld (1996). Using the natural gradient or the relative gradient leads to the blind separation algorithms with the equivariant property; that is, the performance of the algorithms is independent from the scaling of the sources. Acknowledgments We thank the reviewers for their constructive comments on the manuscript and acknowledge the useful discussions with J.-F. Cardoso and A. Cichocki. References Amari, S. (1967). A theory of adaptive pattern classifiers. IEEE Trans. on Electronic Computers, EC.16(3), 299–307. Amari, S. (1993). Backpropagation and stochastic gradient descent method. Neurocomputing, 5, 185–196. Amari, S. (1997). Neural learning in structured parameter spaces—natural Riemannian gradient. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo Advances in neural information processing systems, 8, (pp. 757–763). Cambridge, MA: MIT Press.
1482
Howard Hua Yang and Shun-ichi Amari
Barlow, H. B., & Foldi´ ¨ ak, P. (1989). Adaptation and decorrelation in the cortex. In C. Miall, R. M. Durbin, & G. J. Mitchison (Eds.), The computing neuron (pp. 54–72). Reading, MA: Addison-Wesley. Bell, A. J., & Sejnowski, T. J. (1995a). An information-maximisation approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bell, A. J., & Sejnowski, T. J. (1995b). Fast blind separation based on information theory. In Proceedings 1995 International Symposium on Nonlinear Theory and Applications (vol. 1, pp. 43–47). Las Vegas. Cardoso, J.-F., & Laheld, B. 1996. Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 44(12), 3017–3030. Cichocki, A., Unbehauen, R., Moszczynski, ´ L., & Rummert, E. (1994). A new online adaptive learning algorithm for blind separation of source signals. In ISANN94 (pp. 406–411). Taiwan. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314. Deco, G., & Brauer, W. (1996). Nonlinear higher-order statistical decorrelation by volume-conserving neural architectures. Neural Networks, 8(4), 525–535. Nadal, J. P., & Parga, N. (1994). Nonlinear neurons in the low noise limit: A factorial code maximizes information transfer. Network, 5, 561–581. Stuart, A., & Ord, J. K. (1994). Kendall’s advanced theory of statistics. London: Edward Arnold.
Received August 19, 1996, accepted February 21, 1997.
Communicated by Anthony Bell
A Fast Fixed-Point Algorithm for Independent Component Analysis Aapo Hyv¨arinen Erkki Oja Helsinki University of Technology, Laboratory of Computer and Information Science, Espoo, Finland
We introduce a novel fast algorithm for independent component analysis, which can be used for blind source separation and feature extraction. We show how a neural network learning rule can be transformed into a fixedpoint iteration, which provides an algorithm that is very simple, does not depend on any user-defined parameters, and is fast to converge to the most accurate solution allowed by the data. The algorithm finds, one at a time, all nongaussian independent components, regardless of their probability distributions. The computations can be performed in either batch mode or a semiadaptive manner. The convergence of the algorithm is rigorously proved, and the convergence speed is shown to be cubic. Some comparisons to gradient-based algorithms are made, showing that the new algorithm is usually 10 to 100 times faster, sometimes giving the solution in just a few iterations. 1 Introduction Independent component analysis (ICA) (Comon, 1994; Jutten & Herault, 1991) is a signal processing technique whose goal is to express a set of random variables as linear combinations of statistically independent component variables. Two interesting applications of ICA are blind source separation and feature extraction. In the simplest form of ICA (Comon, 1994), we observe m scalar random variables v1 , v2 , . . . , vm , which are assumed to be linear combinations of n unknown independent components s1 , s2 , . . . , sn that are mutually statistically independent and zeromean. In addition, we must assume n ≤ m. Let us arrange the observed variables vi into a vector v = (v1 , v2 , . . . , vm )T and the component variables si into a vector s, respectively; then the linear relationship is given by v = As.
(1.1)
Here, A is an unknown m × n matrix of full rank, called the mixing matrix. The basic problem of ICA is then to estimate the original components si from the mixtures vj or, equivalently, to estimate the mixing matrix A. Neural Computation 9, 1483–1492 (1997)
c 1997 Massachusetts Institute of Technology °
1484
Aapo Hyv¨arinen and Erkki Oja
The fundamental restriction of the model is that we can estimate only nongaussian independent components (except if just one of the independent components is gaussian). Moreover, neither the energies nor the signs of the independent components can be estimated, because any constant multiplying an independent component in equation 1.1 could be cancelled by dividing the corresponding column of the mixing matrix A by the same constant. For mathematical convenience, we define here that the independent components si have unit variance. This makes the (nongaussian) independent components unique, up to their signs. Note that no order is defined between the independent components. In blind source separation (Cardoso, 1990; Jutten & Herault, 1991), the observed values of v correspond to a realization of an m-dimensional discretetime signal v(t), t = 1, 2, . . . Then the components si (t) are called source signals, which are usually original, uncorrupted signals or noise sources. Another possible application of ICA is feature extraction (Bell & Sejnowski, 1996, in press; Hurri, Hyv¨arinen, Karhunen, & Oja, 1996). Then the columns of A represent features, and si signals the presence and the amplitude of the ith feature in the observed data v. The problem of estimating the matrix A in equation 1.1 can be somewhat simplified by performing a preliminary sphering or prewhitening of the data v (Cardoso, 1990; Comon, 1994; Oja & Karhunen, 1995). The observed vector v is linearly transformed to a vector x = Mv such that its elements xi are mutually uncorrelated, and all have unit variance. Thus the correlation matrix of x equals unity: E{xxT } = I. This transformation is always possible and can be accomplished by classical principal component analysis. At the same time, the dimensionality of the data should be reduced so that the dimension of the transformed data vector x equals n, the number of independent components. This also has the effect of reducing noise. After the transformation we have x = Mv = MAs = Bs,
(1.2)
where B = MA is an orthogonal matrix due to our assumptions on the components si : it holds E{xxT } = BE{ssT }BT = BBT = I. Thus we have reduced the problem of finding an arbitrary full-rank matrix A to the simpler problem of finding an orthogonal matrix B, which then gives s = BT x. If the ith column of B is denoted bi , then the ith independent component can be computed from the observed x as si = (bi )T x. The current algorithms for ICA can be roughly divided into two categories. The algorithms in the first category (Cardoso, 1992; Comon, 1994) rely on batch computations minimizing or maximizing some relevant criterion functions. The problem with these algorithms is that they require very complex matrix or tensorial operations. The second category contains adaptive algorithms often based on stochastic gradient methods, which may have implementations in neural networks (Amari, Cichocki, & Yang, 1996;
A Fast Fixed-Point Algorithm
1485
Bell & Sejnowski, 1995; Delfosse & Loubaton, 1995; Hyv¨arinen & Oja, 1996; Jutten & Herault, 1991; Moreau & Macchi, 1993; Oja & Karhunen, 1995). The main problem with this category is the slow convergence and the fact that the convergence depends crucially on the correct choice of the learning rate parameters. In this article we introduce a novel approach for performing the computations needed in ICA.1 We introduce an algorithm using a very simple yet highly efficient fixed-point iteration scheme for finding the local extrema of the kurtosis of a linear combination of the observed variables. It is well known (Delfosse & Loubaton, 1995) that finding the local extrema of kurtosis is equivalent to estimating the nongaussian independent components. However, the convergence of our algorithm will be proved independent of these well-known results. The computations can be performed either in batch mode or semiadaptively. The new algorithm is introduced and analyzed in section 3, after presenting in section 2 a short review of kurtosis minimization-maximization and its relation to neural network type learning rules. 2 ICA by Kurtosis Minimization and Maximization Most suggested solutions to the ICA problem use the fourth-order cumulant or kurtosis of the signals, defined for a zero-mean random variable v as kurt(v) = E{v4 } − 3(E{v2 })2 .
(2.1)
For a gaussian random variable, kurtosis is zero; for densities peaked at zero, it is positive, and for flatter densities, negative. Note that for two independent random variables v1 and v2 and for a scalar α, it holds kurt(v1 + v2 ) = kurt(v1 ) + kurt(v2 ) and kurt(αv1 ) = α 4 kurt(v1 ). Let us search for a linear combination of the sphered observations xi , say, wT x, such that it has maximal or minimal kurtosis. Obviously, this is meaningful only if the norm of w is somehow bounded; let us assume kwk = 1. Using the orthogonal mixing matrix B, let us define z = BT w. Then also kzk = 1. Using equation 1.2 and the properties of the kurtosis, we have kurt(wT x) = kurt(wT Bs) = kurt(zT s) =
n X
z4i kurt(si ).
(2.2)
i=1
Under the constraint kwk = kzk = 1, the function in equation 2.2 has a number of local minima and maxima. For simplicity, let us assume for the moment that the mixture contains at least one independent component 1 It was brought to our attention that a similar algorithm for blind deconvolution was proposed by Shalvi and Weinstein (1993).
1486
Aapo Hyv¨arinen and Erkki Oja
whose kurtosis is negative and at least one whose kurtosis is positive. Then, as was shown by Delfosse and Loubaton (1995), the extremal points of equation 2.2 are the canonical base vectors z = ±ej , that is, vectors whose components are all zero except one component that equals ±1. The corresponding weight vectors are w = Bz = Bej = bj (perhaps with a minus sign), i.e., the columns of the orthogonal mixing matrix B. So by minimizing or maximizing the kurtosis in equation 2.2 under the given constraint, the columns of the mixing matrix are obtained as solutions for w, and the linear combination itself will be one of the independent components: wT x = (bi )T x = si . Equation 2.2 also shows that gaussian components cannot be estimated by this way, because for them kurt(si ) is zero. To minimize or maximize kurt(wT x), a neural algorithm based on gradient descent or ascent can be used (Delfosse & Loubaton, 1995; Hyv¨arinen & Oja, 1996). Then w is interpreted as the weight vector of a neuron with input vector x. The objective function can be simplified because the inputs have been sphered. It holds kurt(wT x) = E{(wT x)4 } − 3[E{(wT x)2 }]2 = E{(wT x)4 } − 3kwk4 . (2.3) Also the constraint kwk = 1 must be taken into account, for example, by a penalty term (Hyv¨arinen & Oja, 1996). Then the final objective function is J(w) = E{(wT x)4 } − 3kwk4 + F(kwk2 ),
(2.4)
where F is a penalty term due to the constraint. Several forms for the penalty term have been suggested by Hyv¨arinen and Oja (1996).2 In the following, the exact form of F is not important. Denoting by x(t) the sequence of observations, by µ(t) the learning rate sequence, and by f the derivative of F/2, the online learning algorithm then has the form w(t+1) = w(t) ±µ(t)[x(t)(w(t)T x(t))3 −3kw(t)k2 w(t)+ f (kw(t)k2 )w(t)].
(2.5)
The first two terms in brackets are obtained from the gradient of kurt(wT x) when instantaneous values are used instead of the expectation. The third term in brackets is obtained from the gradient of F(kwk2 ); as long as this is a function of kwk2 only, its gradient has the form scalar × w. A positive sign before the brackets means finding the local maxima; a negative sign corresponds to local minima. The convergence of this kind of algorithm can be proved using the principles of stochastic approximation. The advantage of such neural learning 2 Note that in Hyv¨ arinen and Oja (1996), the second term in J was also included in the penalty term.
A Fast Fixed-Point Algorithm
1487
rules is that the inputs x(t) can be used in the algorithm at once, thus enabling fast adaptation in a nonstationary environment. A resulting trade-off is that the convergence is slow and depends on a good choice of the learning rate sequence µ(t). A bad choice of the learning rate can, in practice, destroy convergence. Therefore, some ways to make the learning radically faster and more reliable may be needed. The fixed-point iteration algorithms are such an alternative. The fixed points w of the learning rule (see equation 2.5) are obtained by taking the expectations and equating the change in the weight to 0: E{x(wT x)3 } − 3kwk2 w + f (kwk2 )w = 0.
(2.6)
The time index t has been dropped. A deterministic iteration could be formed from equation 2.6 by a number of ways, for example, by standard numerical algorithms for solving such equations. A very fast iteration is obtained, as shown in the next section, if we write equation 2.6 in the form w = scalar × (E{x(wT x)3 } − 3kwk2 w).
(2.7)
Actually, because the norm of w is irrelevant, it is the direction of the righthand side that is important. Therefore the scalar in equation 2.7 is not significant, and its effect can be replaced by explicit normalization, or the projection of w onto the unit sphere. 3 Fixed-Point Algorithm 3.1 The Algorithm. 3.1.1 Estimating One Independent Component. Assume that we have collected a sample of the sphered (or prewhitened) random vector x, which in the case of blind source separation is a collection of linear mixtures of independent source signals according to equation 1.2. Using the derivation of the preceding section, we get the following fixed-point algorithm for ICA: 1. Take a random initial vector w(0) of norm 1. Let k = 1. 2. Let w(k) = E{x(w(k − 1)T x)3 } − 3w(k − 1). The expectation can be estimated using a large sample of x vectors (say, 1000 points). 3. Divide w(k) by its norm. 4. If |w(k)T w(k − 1)| is not close enough to 1, let k = k + 1, and go back to step 2. Otherwise, output the vector w(k). The final vector w(k) given by the algorithm equals one of the columns of the (orthogonal) mixing matrix B. In the case of blind source separation, this means that w(k) separates one of the nongaussian source signals in the sense that w(k)T x(t), t = 1, 2, . . . equals one of the source signals.
1488
Aapo Hyv¨arinen and Erkki Oja
A remarkable property of our algorithm is that a very small number of iterations, usually 5 to 10, seems to be enough to obtain the maximal accuracy allowed by the sample data. This is due to the cubic convergence shown below. 3.1.2 Estimating Several Independent Components. To estimate n independent components, we run this algorithm n times. To ensure that we estimate each time a different independent component, we need only to add a simple orthogonalizing projection inside the loop. Recall that the columns of the mixing matrix B are orthonormal because of the sphering. Thus we can estimate the independent components one by one by projecting the current solution w(k) on the space orthogonal to the columns of the mixing matrix B previously found. Define the matrix B as a matrix whose columns are the previously found columns of B. Then add the projection operation in the beginning of step 3: T
Let w(k) = w(k) − B B w(k). Divide w(k) by its norm. Also the initial random vector should be projected this way before starting the iterations. To prevent estimation errors in B from deteriorating the estimate w(k), this projection step can be omitted after the first few iterations. Once the solution w(k) has entered the basin of attraction of one of the fixed points, it will stay there and converge to that fixed point. In addition to the hierarchical (or sequential) orthogonalization described above, any other method of orthogonalizing the weight vectors could also be used. In some applications, a symmetric orthogonalization might be useful. This means that the fixed-point step is first performed for all the n weight vectors, and then the matrix W(k) = (w1 (k), . . . , wn (k)) of the weight vectors is orthogonalized, for example, using the well-known formula, W(k) = W(k)(W(k)T W(k))−1/2 ,
(3.1)
where (W(k)T W(k))−1/2 is obtained from the eigenvalue decomposition of W(k)T W(k) = EDET as (W(k)T W(k))−1/2 = ED−1/2 ET . However, the convergence proof below applies only to the hierarchical orthogonalization. 3.1.3 A Semiadaptive Version. A disadvantage of many batch algorithms is that large amounts of data must be stored simultaneously in working memory. Our fixed-point algorithm, however, can be used in a semiadaptive manner so as to avoid this problem. This can be accomplished simply by computing the expectation E{x(w(k − 1)T x)3 } by an online algorithm for N consecutive sample points, keeping w(k − 1) fixed, and updating the vector w(k) after the average over all the N sample points has been computed. This semiadaptive version also makes adaptation to nonstationary data possible. Thus, the semiadaptive algorithm combines many of the advantages usually attributed to either online or batch algorithms.
A Fast Fixed-Point Algorithm
1489
3.2 Convergence Proof. Now we prove the convergence of our algorithm. To begin, make the change of variables z(k) = BT w(k). Note that the effect of the projection step is to set to zero all components zi (k) of z(k) such that the ith column of B has already been estimated. Therefore, we can simply consider in the following the algorithm without the projection step, taking into account that the zi (k) corresponding to the independent components previously estimated are zero. First, using equation 2.2, we get the following form for step 2 of the algorithm: z(k) = E{s(z(k − 1)T s)3 } − 3z(k − 1).
(3.2)
Expanding the first term, we can calculate explicitly the expectation, and obtain for the ith component of the vector z(k), zi (k) = E{s4i }zi (k − 1)3 + 3
X
zj (k − 1)2 zi (k − 1) − 3zi (k − 1),
(3.3)
j6=i
where we have used the fact that by the statistical independence of the si , we have E{s2i sj2 } = 1, and E{s3i sj } = E{s2i sj sl } = E{si sj sl sm } = 0 for four different indices i, j, l, and m. Using kz(k)k = kw(k)k = 1, equation 3.3 simplifies to zi (k) = kurt(si ) zi (k − 1)3 ,
(3.4)
where kurt(si ) = E{s4i } − 3 is the kurtosis of the ith independent component. Note that the subtraction of 3w(k − 1) from the right side cancelled the term due to the cross-variances, enabling direct access to the fourth-order cumulants. Choosing j so that kurt(sj ) 6= 0 and zj (k − 1) 6= 0, we further obtain, | kurt(si )| |zi (k)| = |zj (k)| | kurt(sj )|
µ
|zi (k − 1)| |zj (k − 1)|
¶3 .
(3.5)
Note that the assumption zj (k − 1) 6= 0 implies that the jth column of B is not among those already found. Next note that |zi (k)|/|zj (k)| is not changed by the normalization step 3. It is therefore possible to solve explicitly |zi (k)|/|zj (k)| from this recursive formula, which yields Ãp !3k p | kurt(sj )| | kurt(si )| |zi (0)| |zi (k)| p = p |zj (k)| | kurt(sj )| |zj (0)| | kurt(si )|
(3.6)
p for all k > 0. For j = arg maxp | kurt(sp )| |zp (0)|, we see that all the other components zi (k), i 6= j quickly become small compared to zj (k). Taking the normalization kz(k)k = kw(k)k = 1 into account, this means that zj (k) → 1
1490
Aapo Hyv¨arinen and Erkki Oja
and zi (k) → 0 for all i 6= j. This implies that w(k) = Bz(k) converges to the column bj of the mixing matrix B, for which the kurtosis of the corresponding independent component sj is not zero and which has not yet been found. This proves the convergence of our algorithm. 4 Simulation Results The fixed-point algorithm was applied to blind separation of four source signals from four observed mixtures. Two of the source signals had a uniform distribution, and the other two were obtained as cubes of gaussian variables. Thus, the source signals included both subgaussian and supergaussian signals. Using different random mixing matrices and initial values w(0), five iterations were usually sufficient for estimation of one column of the orthogonal mixing matrix to an accuracy of four decimal places. Next, the convergence speed of our algorithm was compared with the speed of the corresponding neural stochastic gradient algorithm. In the neural algorithm, an empirically optimized learning rate sequence was used. The computational overhead of the fixed-point algorithm was optimized by initially using a small sample size (200 points) in step 2 of the algorithm, and increasing it at every iteration for greater accuracy. The number of floating-point operations was calculated for both methods. The fixed-point algorithm needed only 10 percent of the floating-point operations required by the neural algorithm. This result was achieved with an empirically optimized learning rate. If we had been obliged to choose the learning rate without preliminary testing, the speed-up factor would have been of the order of 100. In fact, the neural stochastic gradient algorithm might not have converged at all. These simulation results confirm the theoretical implications of very fast convergence. 5 Discussion We introduced a batch version of a neural learning algorithm for ICA. This algorithm, which is based on the fixed-point method, has several advantages as compared to other suggested ICA methods. 1. Equation 3.6 shows that the convergence of our algorithm is cubic. This means very fast convergence and is rather unique among the ICA algorithms. It is also in contrast to other similar fixed-point algorithms, like the power method, which often have only linear convergence. In fact, our algorithm can be considered a higher-order generalization of the power method for tensors. 2. Contrary to gradient-based algorithms, there is no learning rate or other adjustable parameters in the algorithm, which makes it easy to use and more reliable.
A Fast Fixed-Point Algorithm
1491
3. The algorithm, in its hierarchical version, finds the independent components one at a time instead of working in parallel like most of the suggested ICA algorithms that solve the entire mixing matrix. This makes it possible to estimate only certain desired independent components, provided we have sufficient prior information of the weight matrices corresponding to those components. For example, if the initial weight vector of the algorithm is the first principal component, the algorithm finds probably the most important independent component, such that the norm of the corresponding column in the original mixing matrix A is the largest. 4. Components of both negative kurtosis (i.e., subgaussian components) and positive kurtosis (i.e., supergaussian components) can be found by starting the algorithm from different initial points and possibly removing the effect of the already found independent components by a projection onto an orthogonal subspace. If just one of the independent components is gaussian (or otherwise has zero kurtosis), it can be estimated as the residual that is left over after extracting all other independent components. Recall that more than one gaussian independent component cannot be estimated in the ICA model. 5. Although the algorithm was motivated as a short-cut method to make neural learning for kurtosis minimization and maximization faster, its convergence was proved independent of the neural algorithm and the well-known results on the connection between ICA and kurtosis. Indeed, our proof is based on principles different from those used so far in ICA and thus opens new lines for research. Recent developments of the theory presented in this article can be found in Hyv¨arinen (1997a,b). References Amari, S., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind source separation. In Advances in Neural Information Processing 8 (Proc. NIPS’95). Cambridge, MA: MIT Press. Bell, A., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bell, A., & Sejnowski, T. (1996). Learning higher-order structure of a natural sound. Network, 7, 261–266. Bell, A., & Sejnowski, T. (in press). The “independent components” of natural scenes are edge filters. Vision Research. Cardoso, J.-F. (1990). Eigen-structure of the fourth-order cumulant tensor with application to the blind source separation problem. In Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (pp. 2655–2658). Albuquerque, NM. Cardoso, J.-F. (1992). Iterative techniques for blind source separation using only fourth-order cumulants. In Proc. EUSIPCO (pp. 739–742). Brussels.
1492
Aapo Hyv¨arinen and Erkki Oja
Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36, 287–314. Delfosse, N., & Loubaton, P. (1995). Adaptive blind separation of independent sources: A deflation approach. Signal Processing, 45, 59–83. Hurri, J., Hyv¨arinen, A., Karhunen, J., & Oja, E. (1996). Image feature extraction using independent component analysis. In Proc. NORSIG’96. Espoo, Finland. Hyv¨arinen, A. (1997a). A family of fixed-point algorithms for independent component analysis. In Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP ’97) (pp. 3917–3920). Munich, Germany. Hyv¨arinen, A. (1997b). Independent component analysis by minimization of mutual information (Technical Report). Helsinki: Helsinki University of Technology, Laboratory of Computer and Information Science. Hyv¨arinen, A., & Oja, E. (1996). A neuron that learns to separate one independent component from linear mixtures. In Proc. IEEE Int. Conf. on Neural Networks (pp. 62–67). Washington, D.C. Jutten, C., & Herault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–10. Moreau, E., & Macchi, O. (1993). New self-adaptive algorithms for source separation based on contrast functions. In Proc. IEEE Signal Processing Workshop on Higher Order Statistics (pp. 215–219). Lake Tahoe, NV. Oja, E., & Karhunen, J. (1995). Signal separation by nonlinear Hebbian learning. In M. Palaniswami, Y. Attikiouzel, R. Marks, D. Fogel, & T. Fukuda (Eds.), Computational intelligence—a dynamic system perspective (pp. 83–97). New York: IEEE Press. Shalvi, O., & Weinstein, E. (1993). Super-exponential methods for blind deconvolution. IEEE Trans. on Information Theory, 39(2), 504–519.
Received April 8, 1996, accepted December 17, 1996.
Communicated by Garrison Cottrell
Dimension Reduction by Local Principal Component Analysis Nandakishore Kambhatla Todd K. Leen Department of Computer Science and Engineering, Oregon Graduate Institute of Science and Technology, Portland, Oregon 97291-1000, U.S.A.
Reducing or eliminating statistical redundancy between the components of high-dimensional vector data enables a lower-dimensional representation without significant loss of information. Recognizing the limitations of principal component analysis (PCA), researchers in the statistics and neural network communities have developed nonlinear extensions of PCA. This article develops a local linear approach to dimension reduction that provides accurate representations and is fast to compute. We exercise the algorithms on speech and image data, and compare performance with PCA and with neural network implementations of nonlinear PCA. We find that both nonlinear techniques can provide more accurate representations than PCA and show that the local linear techniques outperform neural network implementations. 1 Introduction The objective of dimension reduction algorithms is to obtain a parsimonious description of multivariate data. The goal is to obtain a compact, accurate, representation of the data that reduces or eliminates statistically redundant components. Dimension reduction is central to an array of data processing goals. Input selection for classification and regression problems is a task-specific form of dimension reduction. Visualization of high-dimensional data requires mapping to a lower dimension—usually three or fewer. Transform coding (Gersho & Gray, 1992) typically involves dimension reduction. The initial high-dimensional signal (e.g., image blocks) is first transformed in order to reduce statistical dependence, and hence redundancy, between the components. The transformed components are then scalar quantized. Dimension reduction can be imposed explicitly, by eliminating a subset of the transformed components. Alternatively, the allocation of quantization bits among the transformed components (e.g., in increasing measure according to their variance) can result in eliminating the low-variance components by assigning them zero bits. Recently several authors have used neural network implementations of dimension reduction to signal equipment failures by novelty, or outlier, Neural Computation 9, 1493–1516 (1997)
c 1997 Massachusetts Institute of Technology °
1494
Nandakishore Kambhatla and Todd K. Leen
detection (Petsche et al., 1996; Japkowicz, Myers, & Gluck, 1995). In these schemes, high-dimensional sensor signals are projected onto a subspace that best describes the signals obtained during normal operation of the monitored system. New signals are categorized as normal or abnormal according to the distance between the signal and its projection1 . The classic technique for linear dimension reduction is principal component analysis (PCA). In PCA, one performs an orthogonal transformation to the basis of correlation eigenvectors and projects onto the subspace spanned by those eigenvectors corresponding to the largest eigenvalues. This transformation decorrelates the signal components, and the projection along the high-variance directions maximizes variance and minimizes the average squared residual between the original signal and its dimension-reduced approximation. A neural network implementation of one-dimensional PCA implemented by Hebb learning was introduced by Oja (1982) and expanded to hierarchical, multidimensional PCA by Sanger (1989), Kung and Diemantaras (1990), and Rubner and Tavan (1989). A fully parallel (nonhierarchical) design that extracts orthogonal vectors spanning an m-dimensional PCA subspace was given by Oja (1989). Concurrently, Baldi and Hornik (1989) showed that the error surface for linear, three-layer autoassociators with hidden layers of width m has global minima corresponding to input weights that span the m-dimensional PCA subspace. Despite its widespread use, the PCA transformation is crippled by its reliance on second-order statistics. Though uncorrelated, the principal components can be highly statistically dependent. When this is the case, PCA fails to find the most compact description of the data. Geometrically, PCA models the data as a hyperplane embedded in the ambient space. If the data components have nonlinear dependencies, PCA will require a largerdimensional representation than would be found by a nonlinear technique. This simple realization has prompted the development of nonlinear alternatives to PCA. Hastie (1984) and Hastie and Stuetzle (1989) introduce their principal curves as a nonlinear alternative to one-dimensional PCA. Their parameterized curves f (λ): R → Rn are constructed to satisfy a self-consistency requirement. Each data point x is projected to the closest point on the curve λ f (x) = argminµ kx − f (µ)k, and the expectation of all data points that project to the same parameter value λ is required to be on the curve. Thus f (3) = Ex [x | λ f (x) = 3 ]. This mathematical statement reflects the desire that the principal curves pass through the middle of the data.
1 These schemes also use a sigmoidal contraction map following the projection so that new signals that are close to the subspace, yet far from the training data used to construct the subspace, can be properly tagged as outliers.
Dimension Reduction by Local PCA
1495
Hastie and Stuetzle (1989) prove that the curve f (λ) is a principal curve iff it is a critical point (with respect to variations in f (λ)) of the mean squared distance between the data and their projection onto the curve. They also show that if the principal curves are lines, they correspond to PCA. Finally, they generalize their definitions from principal curves to principal surfaces. Neural network approximators for principal surfaces are realized by fivelayer, autoassociative networks. Independent of Hastie and Stuetzle’s work, several researchers (Kramer, 1991; Oja, 1991; DeMers & Cottrell, 1993; Usui, Nakauchi, & Nakano, 1991) have suggested such networks for nonlinear dimension reduction. These networks have linear first (input) and fifth (output) layers, and sigmoidal nonlinearities on the second- and fourth-layer nodes. The input and output layers have width n. The third layer, which carries the dimension-reduced representation, has width m < n. We will refer to this layer as the representation layer. Researchers have used both linear and sigmoidal response functions for the representation layer. Here we consider only linear response in the representation layer. The networks are trained to minimize the mean squared distance between the input and output and, because of the middle-layer bottleneck, build a dimension-reduced representation of the data. In view of Hastie and Stuetzle’s critical point theorem, and the mean square error (MSE) training criteria for five-layer nets, these networks can be viewed as approximators of principal surfaces.2 Recently Hecht-Nielsen (1995) extended the application of five-layer autoassociators from dimension reduction to encoding. The third layer of his replicator networks has a staircase nonlinearity. In the limit of infinitely steep steps, one obtains a quantization of the middle-layer activity, and hence a discrete encoding of the input signal.3 In this article, we propose a locally linear approach to nonlinear dimension reduction that is much faster to train than five-layer autoassociators and, in our experience, provides superior solutions. Like five-layer autoassociators, the algorithm attempts to minimize the MSE between the original
2 There is, as several researchers have pointed out, a fundamental difference between the representation constructed by Hastie’s principal surfaces and the representation constructed by five-layer autoassociators. Specifically, autoassociators provide a continuous parameterization of the embedded manifold, whereas the principal surfaces algorithm does not constrain the parameterization to be continuous. 3 Perfectly sharp staircase hidden-layer activations are, of course, not trainable by gradient methods, and the plateaus of a rounded staircase will diminish the gradient signal available for training. However, with parameterized hidden unit activations h(x; a) with h(x; 0) = x and h(x; 1) a sigmoid staircase, one can envision starting training with linear activations and gradually shift toward a sharp (but finite sloped) staircase, thereby obtaining an approximation to Hecht-Nielsen’s replicator networks. The changing activation function will induce both smooth changes and bifurcations in the set of cost function minima. Practical and theoretical issues in such homotopy methods are discussed in Yang and Yu (1993) and Coetzee and Stonick (1995).
1496
Nandakishore Kambhatla and Todd K. Leen
data and its reconstruction from a low-dimensional representation—what we refer to as the reconstruction error. However, while five-layer networks attempt to find a global, smooth manifold that lies close to the data, our algorithm finds a set of hyperplanes that lie close to the data. In a loose sense, these hyperplanes locally approximate the manifold found by five-layer networks. Our algorithm first partitions the data space into disjoint regions by vector quantization (VQ) and then performs local PCA about each cluster center. For brevity, we refer to the hybrid algorithms as VQPCA. We introduce a novel VQ distortion function that optimizes the clustering for the reconstruction error. After training, dimension reduction proceeds by first assigning a datum to a cluster and then projecting onto the m-dimensional principal subspace belonging to that cluster. The encoding thus consists of the index of the cluster and the local, dimension-reduced coordinates within the cluster. The resulting algorithm directly minimizes (up to local optima) the reconstruction error. In this sense, it is optimal for dimension reduction. The computation of the partition is a generalized Lloyd algorithm (Gray, 1984) for a vector quantizer that uses the reconstruction error as the VQ distortion function. The Lloyd algorithm iteratively computes the partition based on two criteria that ensure minimum average distortion: (1) VQ centers lie at the generalized centroids of the quantizer regions and (2) the VQ region boundaries are surfaces that are equidistant (in terms of the distortion function) from adjacent VQ centers. The application of these criteria is considered in detail in section 3.2. The primary point is that constructing the partition according to these criteria provides a (local) minimum for the reconstruction error. Training the local linear algorithm is far faster than training five-layer autoassociators. The clustering operation is the computational bottleneck, though it can be accelerated using the usual tree-structured or multistage VQ. When encoding new data, the clustering operation is again the main consumer of computation, and the local linear algorithm is somewhat slower for encoding than five-layer autoassociators. However, decoding is much faster than for the autoassociator and is comparable to PCA. Local PCA has been previously used for exploratory data analysis (Fukunaga & Olsen, 1971) and to identify the intrinsic dimension of chaotic signals (Broomhead, 1991; Hediger, Passamante, & Farrell, 1990). Bregler and Omohundro (1995) use local PCA to find image features and interpolate image sequences. Hinton, Revow, and Dayan (1995) use an algorithm based on local PCA for handwritten character recognition. Independent of our work, Dony and Haykin (1995) explored a local linear approach to transform coding, though the distortion function used in their clustering is not optimized for the projection, as it is here. Finally, local linear methods for regression have been explored quite thoroughly. See, for example, the LOESS algorithm in Cleveland and Devlin (1988).
Dimension Reduction by Local PCA
1497
In the remainder of the article, we review the structure and function of autoassociators, introduce the local linear algorithm, and present experimental results applying the algorithms to speech and image data. 2 Dimension Reduction and Autoassociators As considered here, dimension reduction algorithms consist of a pair of maps g: Rn → Rm with m < n and f : Rm → Rn . The function g(x) maps the original n-dimensional vector x to the dimension-reduced vector y ∈ Rm . The function f (y) maps back to the high-dimensional space. The two maps, g and f , correspond to the encoding and decoding operation, respectively. If these maps are smooth on open sets, then they are diffeomorphic to a canonical projection and immersion, respectively. In general, the projection loses some of the information in the original representation, and f (g(x)) 6= x. The quality of the algorithm, and adequacy of the chosen target dimension m, are measured by the algorithm’s ability to reconstruct the original data. A convenient measure, and the one we employ here, is the average squared residual, or reconstruction error:
E = Ex [ kx − f (g(x))k2 ]. The variance in the original vectors provides a useful normalization scale for the squared residuals, so our experimental results are quoted as normalized reconstruction error:
Enorm =
Ex [ kx − f (g(x))k2 ] . Ex [ kx − Ex xk2 ]
(2.1)
This may be regarded as the noise-to-signal ratio for the dimension reduction, with the signal strength defined as its variance. Neural network implementations of these maps are provided by autoassociators, layered, feedforward networks with equal input and output dimension n. During training, the output targets are set to the inputs; thus, autoassociators are sometimes called self-supervised networks. When used to perform dimension reduction, the networks have a hidden layer of width m < n. This hidden, or representation, layer is where the low-dimensional representation is extracted. In terms of the maps defined above, processing from the input to the representation layer corresponds to the projection g, and processing from the representation to the output layer corresponds to the immersion f . 2.1 Three-Layer Autoassociators. Three-layer autoassociators with n input and output units and m < n hidden units, trained to perform the identity transformation over the input data, were used for dimension reduction by several researchers (Cottrell, Munro, & Zipser, 1987; Cottrell &
1498
Nandakishore Kambhatla and Todd K. Leen
Metcalfe, 1991; Golomb, Lawrence, & Sejnowski, 1991). In these networks, the first layer of weights performs the projection g and the second the immersion f . The output f (g(x)) is trained to match the input x in the mean square sense. Since the transformation from the hidden to the output layer is linear, the network outputs lie in an m-dimensional linear subspace of Rn . The hidden unit activations define coordinates on this hyperplane. If the hidden unit activation functions are linear, it is obvious that the best one can do to minimize the reconstruction error is to have the hyperplane correspond to the m-dimensional principal subspace. Even if the hidden-layer activation functions are nonlinear, the output is still an embedded hyperplane. The nonlinearities introduce nonlinear scaling of the coordinate intervals on the hyperplane. Again, the solution that minimizes the reconstruction error is the m-dimensional principal subspace. These observations were formalized in two early articles. Baldi and Hornik (1989) show that optimally trained three-layer linear autoassociators perform an orthogonal projection of the data onto the m-dimensional principal subspace. Bourlard and Kamp (1988) show that adding nonlinearities in the hidden layer cannot reduce the reconstruction error. The optimal solution remains the PCA projection.
2.2 Five-Layer Autoassociators. To overcome the PCA limitation of three-layer autoassociators and provide the capability for genuine nonlinear dimension reduction, several authors have proposed five-layer autoassociators (e.g., Oja, 1991; Usui et al., 1991; Kramer, 1991; DeMers & Cottrell, 1993; Kambhatla & Leen, 1994; Hecht-Nielsen, 1995). These networks have sigmoidal second and fourth layers and linear first and fifth layers. The third (representation) layer may have linear or sigmoidal response. Here we use linear response functions. The first two layers of weights carry out a nonlinear projection g: Rn → Rm , and the last two layers of weights carry out a nonlinear immersion f : Rm → Rn (see Figure 1). By the universal approximation theorem for single hidden-layer sigmoidal nets (Funahashi, 1989; Hornik, Stinchcombe, & White, 1989), any continuous composition of immersion and projection (on compact domain) can be approximated arbitrarily closely by the structure. The activities of the nodes in the third, or representation layer form global curvilinear coordinates on a submanifold of the input space (see Figure 1b). We thus refer to five-layer autoassociative networks as a global, nonlinear dimension reduction technique. Several authors report successful implementation of nonlinear PCA using these networks for image (DeMers & Cottrell, 1993; Hecht-Nielsen, 1995; Kambhatla & Leen, 1994) and speech dimension reduction, for characterizing chemical dynamics (Kramer, 1991), and for obtaining concise representations of color (Usui et al., 1991).
Dimension Reduction by Local PCA
1499
1 0 -1 .5 5 Low Dimensional Encoding
0 -.5 -1 -1.5
a)
Original High Dimensional Representation
-1 0
b)
1
Figure 1: (a) Five-layer feedforward autoassociative network with inputs x ∈ Rn and representation layer of dimension m. Outputs x0 ∈ Rn are trained to match the inputs, that is, to minimize E[ kx − x0 k2 ]. (b) Global curvilinear coordinates built by a five-layer network for data distributed on the surface of a hemisphere. When the activations of the representation layer are swept, the outputs trace out the curvilinear coordinates shown by the solid lines.
3 Local Linear Transforms Although five-layer autoassociators are convenient and elegant approximators for principal surfaces, they suffer from practical drawbacks. Networks with multiple sigmoidal hidden layers tend to have poorly conditioned Hessians (see Rognvaldsson, 1994, for a nice exposition) and are therefore difficult to train. In our experience, the variance of the solution with respect to weight initialization can be quite large (see section 4), indicating that the networks are prone to trapping in poor local optimal. We propose an alternative that does not suffer from these problems. Our proposal is to construct local models, each pertaining to a different disjoint region of the data space. Within each region, the model complexity is limited; we construct linear models by PCA. If the local regions are small enough, the data manifold will not curve much over the extent of the region, and the linear model will be a good fit (low bias). Schematically, the training algorithm is as follows: 1. Partition the input space Rn into Q disjoint regions {R(1) , . . . , R(Q) }. 2. Compute the local covariance matrices 6 (i) = E[ (x − Ex)(x − Ex)T | x ∈ R(i) ] ;
i = 1, . . . , Q
and their eigenvectors ej(i) , j = 1, . . . , n. Relabel the eigenvectors so
1500
Nandakishore Kambhatla and Todd K. Leen
.55
-1
-.5
0
.5
1
.255 0 -.25 25 -.5 .5 -1
-.5
0
.5
1
Figure 2: Coordinates built by local PCA for data distributed on the surface of a hemiphere. The solid lines are the two principal eigendirections for the data in each region. One of the regions is shown shaded for emphasis.
that the corresponding eigenvalues are in descending order λ(i) 1 > (i) λ(i) > · · · > λ . n 2 3. Choose a target dimension m and retain the leading m eigendirections for the encoding. The partition is formed by VQ on the training data. The distortion measure for the VQ strongly affects the partition, and therefore the reconstruction error for the algorithm. We discuss two alternative distortion measures. The local PCA defines local coordinate patches for the data, with the orientation of the local patches determined by the PCA within each region Ri . Figure 2 shows a set of local two-dimensional coordinate frames induced on the hemisphere data from figure 1, using the standard Euclidean distance as the VQ distortion measure. 3.1 Euclidean Partition. The simplest way to construct the partition is to build a VQ based on Euclidean distance. This can be accomplished by either an online competitive learning algorithm or by the generalized Lloyd algorithm (Gersho & Gray, 1992), which is the batch counterpart of competitive learning. In either case, the trained quantizer consists of a set of Q reference vectors r(i) , i = 1, . . . , Q and corresponding regions R(i) . The placement of the reference vectors and the definition of the regions satisfy Lloyd’s optimality conditions: 1. Each region, R(i) corresponds to all x ∈ Rn that lie closer to r(i) than to any other reference vector. Mathematically R(i) = {x | dE (x, r(i) ) <
Dimension Reduction by Local PCA
1501
dE (x, r(j) ), ∀j 6= i}, where dE (a, b) is the Euclidean distance between a and b. Thus, a given x is assigned to its nearest neighbor r. 2. Each reference vector r(i) is placed at the centroid of the corresponding region R(i) . For Euclidean distance, the centroid is the mean r(i) = E[x | x ∈ R(i) ]. For Euclidean distance, the regions are connected, convex sets called Voronoi cells. As described in the introduction to section 3, one next computes the covariance matrices for the data in each region R(i) and performs a local PCA projection. The m-dimensional encoding of the original vector x is thus given in two parts: the index of the Voronoi region that the vector lies in and the local coordinates of the point with respect to the centroid, in the basis of the m leading eigenvectors of the corresponding covariance.4 For example, if x ∈ R(i) , the local coordinates are ³ ´ (i) (i) (i) z = e(i) (3.1) 1 · (x − r ), . . . , em · (x − r ) . The decoded vector is given by xˆ = r(i) +
m X
zj ej(i) .
(3.2)
j=1
The mean squared reconstruction error incurred is ˆ 2 ]. Erecon = E[kx − xk
(3.3)
3.2 Projection Partition. The algorithm described above is not optimal because the partition is constructed independent of the projection that follows. To understand the proper distortion function from which to construct the partition, consider the reconstruction error for a vector x that lies in R(i) , ° °2 ° ° m X ° (i) ° (i) (i) T (i) T (i) ° x − r − z e P (x − r(i) ) d(x, r(i) ) ≡ ° j j ° = (x − r ) P ° ° ° j=1 ≡ (x − r(i) )T 5(i) (x − r(i) ) ,
(3.4)
where P(i) is the m × n matrix whose rows are the trailing eigenvectors of the covariance matrix 6 (i) . The matrix 5(i) is the projection orthogonal to the local m-dimensional PCA subspace. 4 The number of bits required to specify the region is small (between four and seven bits in all the experiments presented here) with respect to the number of bits used to express the double-precision coordinates within each region. In this respect, the specification of the region is nearly free.
1502
r
Nandakishore Kambhatla and Todd K. Leen
r (2)
(1)
e1(1)
x
r (2)
r (1)
e(2) 1
x
e(1) 1
e(2) 1
Figure 3: Assignment of the data point x to one of two regions based on (left) Euclidean distance and (right) the reconstruction distance. The reference vectors r(i) and leading eigenvector e(i) 1 are shown for each of two regions (i = 1, 2). See text for explanation.
The reconstruction distance d(x, r(i) ) is the squared projection of the difference vector x − r(i) on the trailing eigenvectors of the covariance matrix for region R(i)5 . Equivalently, it is the squared Euclidean distance to the linear manifold that is defined by the local m-dimensional PCA in the ith local region. Clustering with respect to the reconstruction distance directly minimizes the expected reconstruction error Erecon . Figure 3 illustrates the difference between Euclidean distance and the reconstruction distance, with the latter intended for a one-dimensional local PCA. Suppose we want to determine to which of two regions the data point x belongs. For Euclidean clustering, the distance between the point x and the two centroids r(1) and r(2) is compared, and the data point is assigned to the cluster whose centroid is closest—in this case, region 1. For clustering by the reconstruction distance, the distance from the point to the two onedimensional subspaces (corresponding to the principal subspace for the two regions) is compared, and the data point is assigned to the region whose principal subspace is closest—in this case, region 2. Data points that lie on the intersection of hyperplanes are assigned to the region with lower index. Thus the membership in regions defined by the reconstruction distance can be different from that defined by Euclidean distance. This is because the reconstruction distance does not count the distance along the leading eigendirections. Neglecting the distance along the leading eigenvectors is exactly what is required, since we retain all the information in the leading directions during the PCA projection. Notice too that, unlike the Euclidean Voronoi regions, the regions arising from the reconstruction distance may not be connected sets. Since the reconstruction distance (see equation 3.4) depends on the eigenvectors of 6 (i) , an online algorithm for clustering would be prohibitively 5 Note that when the target dimension m equals 0, the representation is reduced to the reference vector r(i) with no local coordinates. The distortion measure then reduces to the Euclidean distance.
Dimension Reduction by Local PCA
1503
expensive. Instead, we use the generalized Lloyd algorithm to compute the partition iteratively. The algorithm is: 1. Initialize the r(i) to randomly chosen inputs from the training data set. Initialize the 6 (i) to the identity matrix. 2. Partition. Partition the training data into Q regions R(1) , . . . , R(Q) where R(i) = {x | d(x, r(i) ) ≤ d(x, r(j) ); all j 6= i}
(3.5)
with d(x, r(i) ) the reconstruction distance defined in (equation 3.4). 3. Generalized centroid. According to the Lloyd algorithm, the reference vectors r(i) are to be placed at the generalized centroid of the region R(i) . The generalized centroid is defined by r(i) = argminr
1 X (x − r)T 5(i) (x − r), Ni x∈R(i)
(3.6)
where Ni is the number of data points in R(i) . Expanding the projection operator 5 in terms of the eigenvectors ej(i) , j = m + 1, . . . , n and setting to zero the derivative of the argument of the right-hand side of equation 3.6 with respect to r, one finds a set of equations for the generalized centroid6 (Kambhatla, 1995), 5(i) r = 5(i) x¯
(3.7)
where x¯ is the mean of the data in R(i) . Thus any vector r whose projection along the trailing eigenvectors equals the projection of x¯ along the trailing eigenvectors is a generalized centroid of R(i) . For ¯ Next compute the covariance matrices convenience, we take r = x. 6 (i) =
1 X (x − r(i) )(x − r(i) )T Ni x∈R(i)
and their eigenvectors ej(i) . 4. Iterate steps 2 and 3 until the fractional change in the average reconstruction error is below some specified threshold. Following training, vectors are encoded and decoded as follows. To encode a vector x, find the reference vector r(i) that minimizes the reconstruction distance d(x, r), and project x − r(i) onto the leading m eigenvectors 6
In deriving the centroid equations, care must be exercised to take into account the dependence of ej(i) (and hence 5(i) ) on r(i) .
1504
Nandakishore Kambhatla and Todd K. Leen
of the corresponding covariance matrix 6 (i) to obtain the local principal components ´ ³ (i) (i) (i) z = e(i) 1 · (x − r ), . . . , em · (x − r ) .
(3.8)
The encoding of x consists of the index i and the m local principal components z. The decoding, or reconstruction, of the vector x is xˆ = r(i) +
m X
zj ej(i) .
(3.9)
j=1
The clustering in the algorithm directly minimizes the expected reconstruction distance since it is a generalized Lloyd algorithm with the reconstruction distance as the distortion measure. Training in batch mode avoids recomputing the eigenvectors after each input vector is presented. 3.3 Accelerated Clustering. Vector quantization partitions data by calculating the distance between each data point x and all of the reference vectors r(i) . The search and storage requirements for computing the partition can be streamlined by constraining the structure of the VQ. These constraints can compromise performance relative to what could be achieved with a standard, or unconstrained, VQ with the same number of regions. However, the constrained architectures allow a given hardware configuration to support a quantizer with many more regions than practicable for the unconstrained architecture, and they can thus improve speed and accuracy relative to what one can achieve in practice using an unconstrained quantizer. The two most common structures are the tree-search VQ and the multistage VQ (Gersho & Gray, 1992). Tree-search VQ was designed to alleviate the search bottleneck for encoding. At the root of the tree, a partition into b0 regions is constructed. At the next level of the tree, each of these regions is further partitioned into b1 regions (often b0 = b1 = . . .), and so forth. After k levels, each with branching ratio b, there are b k regions in the partition. Encoding new data requires at most kb distortion calculations. The unconstrained quantizer with the same number of regions requires up to b k distortion calculations. Thus the search complexity grows only logarithmically with the number of regions for the tree-search VQ, whereas the search complexity grows linearly with the number of regions for the unconstrained quantizer. However, the number of reference vectors in the tree is b (b k − 1), b−1 whereas the unconstrained quantizer requires only b k reference vectors. Thus the tree typically requires more storage.
Dimension Reduction by Local PCA
1505
Multistage VQ is a form of product coding and as such is economical in both search and storage requirements. At the first stage, a partition into b0 regions is constructed. Then all of the data are encoded using this partition, and the residuals from all b0 regions ² = x−r(i) are pooled. At the next stage, these residuals are quantized by a b1 region quantizer, and so forth. The final k-stage quantizer (again assuming b regions at each level) has b k effective regions. Encoding new data requires at most kb distortion calculations. Although the search complexity is the same as the tree quantizer, the storage requirements are more favorable than for either the tree or the unconstrained quantizer. There are a total b k regions generated by the multistage architecture, requiring only b k reference vectors. The unconstrained quantizer would require b k reference vectors to generate the same number of regions, and the tree requires more. The drawback is that the shapes of the regions in a multistage quantizer are severely restricted. By sketching regions for a two-stage quantizer with two or three regions per stage, the reader can easily show that the final regions consist of two or three shapes that are copied and translated to tile the data space. In constrast, an unconstrained quantizer will construct as many different shapes as required to encode the data with minimal distortion. 4 Experimental Results In this section we present the results of experiments comparing global PCA, five-layer autoassociators, and several variants of VQPCA applied to dimension reduction of speech and image data. We compare the algorithms’ training time and distortion in the reconstructed signal. The distortion measure we use is the reconstruction error normalized by the data variance,
Enorm ≡
ˆ 2] E[kx − xk , E[kx − E[x]k2 ]
(4.1)
where the expectations are with respect to empirical data distribution. We trained the autoassociators using three optimization techniques: conjugate gradient descent (CGD), stochastic gradient descent (SGD), and the Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton method (Press, Flannery, Teukolsky, & Vetterling, 1987). The gradient descent code has a momentum term, and learning rate annealing as in Darken and Moody (1991). We give results for VQPCA using both Euclidean and reconstruction distance, with unconstrained, multistage, and tree-search VQ. For the Euclidean distortion measure with an unconstrained quantizer, we used online competitive learning with the annealed learning rate schedule in Darken and Moody (1991).
1506
Nandakishore Kambhatla and Todd K. Leen
We computed the global PCA for the speech data by diagonalizing the covariance matrices using Householder reduction followed by QL.7 The image training set has fewer data points than the input dimension. Thus the covariance matrix is singular, and we cannot use Householder reduction. Instead we computed the PCA using singular-value decomposition applied to the data matrix. All of the architecture selection was carried out by monitoring performance on a validation (or holdout) set. To limit the space of architectures, the autoassociators have an equal number of nodes in the second and fourth (sigmoidal) layers. These were varied from 5 to 50 in increments of 5. The networks were regularized by early stopping on a validation data set. For VQPCA, we varied the number of local regions for the unconstrained VQ from 5 to 50 in increments of 5. The multistage VQ examples used two levels, with the number of regions at each level varied from 5 to 50 in increments of 5. The branching factor of the tree-structured VQ was varied from 2 to 9. For all experiments and all architectures, we report only the results on those architectures that obtained the lowest reconstruction error on the validation data set. In this sense, we report the best results we obtained. 4.1 Dimension Reduction of Speech. We drew examples of the 12 monothongal vowels extracted from continuous speech in the TIMIT database (Fisher & Doddington, 1986). Each input vector consists of the lowest 32 discrete Fourier transform coefficients (spanning the frequency range 0–4 kHz), time-averaged over the central third of the vowel segment. The training set contains 1200 vectors, the validation set 408 vectors, and the test set 408 vectors. The test set utterances are from speakers not represented in the training or validation sets. Motivated by the desire to capture formant structure in the vowel encodings, we reduced the data from 32 to 2 dimensions. The results of our experiments are shown in Table 1. Table 1 shows the mean and 2-σ error bars (computed over four random initializations) of the test set reconstruction error, and the Sparc 2 training times (in seconds). The numbers in parentheses are the values of architectural parameters. For example Autoassoc. (35) is a five-layer autoassociator with 35 nodes in each of the second and fourth layers. VQPCA-E-MS (40 × 40) is a Euclidean distortion, multistage VQPCA with 40 cells in each of two stages. The VQPCA encodings have about half the reconstruction error of the global PCA or the five-layer networks. The latter failed to obtain significantly better results than the global PCA. The autoassociators show high variance in the reconstruction errors with respect to different random weight initializations. Several nets had higher error than PCA, indicating trapping
7 The Householder algorithm reduces the covariance matrix to tridiagonal form, which is then diagonalized by the QL procedure (Press et al., 1987).
Dimension Reduction by Local PCA
1507
Table 1: Speech Dimension Reduction. Algorithm
Enorm
Training Time (seconds)
PCA
0.443
11
Autoassoc.-CGD (35) Autoassoc.-BFGS (20) Autoassoc.-SGD (35)
0.496 ± .103 0.439 ± .059 0.440 ± .016
7784 ± 7442 3284 ± 1206 35,502 ± 182
VQPCA-Eucl (50) VQPCA-E-MS (40 × 40) VQPCA-E-T (15 × 15)
0.272 ± .010 0.244 ± .008 0.259 ± .002
1915 ± 780 144 ± 41 195 ± 10
VQPCA-Recon (45) VQPCA-R-MS (45 × 45) VQPCA-R-T (9 × 9)
0.230 ± .004 0.208 ± .005 0.242 ± .005
864 ± 102 924 ± 50 484 ± 128
in particularly poor local optima. As expected, stochastic gradient descent showed less variance due to initialization. In contrast, the local linear algorithms are relatively insensitive to initialization. Clustering with reconstruction distance produced somewhat lower error than clustering with Euclidean distance. For both distortion measures, the multistage architecture produced the lowest error. The five-layer autoassociators are very slow to train. The Euclidean multistage and tree-structured local linear algorithms trained more than an order of magnitude faster than the autoassociators. For the unconstrained, Euclidean distortion VQPCA, the partition was determined by online competitive learning and could probably be speeded up a bit by using a batch algorithm. In addition to training time, practicality depends on the time required to encode and decode new data. Table 2 shows the number of floatingpoint operations required to encode and decode the entire database for the different algorithms. The VQPCA algorithms, particularly those using the reconstruction distance, require many more floating-point operations to encode the data than the autoassociators. However, the decoding is much faster than that for autoassociators and is comparable to PCA. 8 The results indicate that VQPCA may not be suitable for real-time applications like videoconferencing where very fast encoding is desired. However, when only the decoding speed is of concern (e.g., for data retrieval), VQPCA algorithms are a good choice because of their accuracy and fast decoding. 8 However, parallel implementation of the autoassociators could outperform the VQPCA decode in terms of clock time.
1508
Nandakishore Kambhatla and Todd K. Leen
Table 2: Encoding and Decoding Times for the Speech Data.
Algorithm PCA
Encode Time (FLOPs)
Decode Time (FLOPs)
158
128
Autoassoc.-CGD (35) Autoassoc.-BFGS (20) Autoassoc.-SGD (35)
2380 1360 2380
2380 1360 2380
VQPCA-Eucl (50) VQPCA-E-MS (40 × 40) VQPCA-E-T (15 × 15)
4957 7836 3036
128 192 128
VQPCA-Recon (45) VQPCA-R-MS (45 × 45) VQPCA-R-T (9 × 9)
87,939 96,578 35,320
128 192 128
In order to test variability of these results across different training and test sets, we reshuffled and repartitioned the data into new training, validation, and tests sets of the same size as those above. The new data sets gave results very close to those reported here (Kambhatla, 1995). 4.2 Dimension Reduction of Images. Our image database consists of 160 images of the faces of 20 people. Each image is a 64 × 64, 8-bit/pixel grayscale image. The database was originally generated by Cottrell and Metcalfe and used in their study of identity, gender, and emotion recognition (Cottrell & Metcalfe, 1991). We adopted the database, and the preparation used here, in order to compare our dimension reduction algorithms with the nonlinear autoassociators used by DeMers and Cottrell (1993). As in DeMers and Cottrell (1993), each image is used complete, as a 4096-dimensional vector, and is preprocessed by extracting the leading 50 principal components computed from the ensemble of 160 images. Thus the base dimension is 50. As in DeMers and Cottrell (1993), we examined reduction to five dimensions. We divided the data into a training set containing 120 images, a validation set of 20 images, and a test set of 20 images. We used PCA, five-layer autoassociators, and VQPCA for reduction to five dimensions. Due to memory constraints, the autoassociators were limited to 5 to 40 nodes in each of the second and fourth layers. The autoassociators and the VQPCA were trained with four different random initializations of the free parameters. The experimental results are shown in Table 3. The five-layer networks attain about 30 percent lower error than global PCA. VQPCA with either Euclidean or reconstruction distance distortion measures attain about 40 per-
Dimension Reduction by Local PCA
1509
Table 3: Image Dimension Reduction. Algorithm
Enorm
Training Time (seconds)
PCA
0.463
5
Autoassoc.-CGD (35) Autoassoc.-BFGS (35) Autoassoc.-SGD (25)
0.441 ± .090 0.377 ± .127 0.327 ± .027
698 ± 533 18,905 ± 15,081 4171 ± 41
VQPCA-Eucl (20) VQPCA-E-MS (5 × 5) VQPCA-E-Tree (4 × 4)
0.179 ± .048 0.307 ± .031 0.211 ± .064
202 ± 57 14 ± 2 31 ± 9
VQPCA-Recon (20) VQPCA-R-MS (20 × 20) VQPCA-R-Tree (5 × 5)
0.173 ± .050 0.240 ± .042 0.218 ± .029
62 ± 5 78 ± 32 79 ± 15
Table 4: Encoding and Decoding Times (FLOPs) for the Image Data.
Algorithm PCA
Encode Time (FLOPs)
Decode Time (FLOPs)
545
500
Autoassoc.-CGD (35) Autoassoc.-BFGS (35) Autoassoc.-SGD (25)
3850 3850 2750
3850 3850 2750
VQPCA-Eucl (20) VQPCA-E-MS (5 × 5) VQPCA-E-T (4 × 4)
3544 2043 1743
500 600 500
91,494 97,493 46,093
500 600 500
VQPCA-Recon (20) VQPCA-R-MS (20 × 20) VQPCA-R-T (5 × 5)
cent lower error than the best autoassociator. There is little distinction between the Euclidean and reconstruction distance clustering for these data. The VQPCA trains significantly faster than the autoassociators. Although the conjugate gradient algorithm is relatively quick, it generates encodings inferior to those obtained with the stochastic gradient descent and BFGS simulators. Table 4 shows the encode and decode times for the different algorithms. We again note that VQPCA algorithms using reconstruction distance clustering require many more floating-point operations (FLOPs) to encode an
1510
Nandakishore Kambhatla and Todd K. Leen
Table 5: Image Dimension Reduction: Training on All Available Data.
Algorithm
Enorm
Training Time (seconds)
PCA Autoassoc.-SGD (30) Autoassoc.-SGD (40)
0.405 0.103 0.073
7 25,306 31,980
VQPCA-Eucl (25) VQPCA-Recon (30)
0.026 0.022
251 116
input vector than does the Euclidean distance algorithm or the five-layer networks. However, as before, the decode times are much less for VQPCA. As before, shuffling and repartitioning the data into training, validation, and test data sets and repeating the experiments returned results very close to those given here. Finally, in order to compare directly with DeMers and Cottrell’s (1993) results, we also conducted experiments training with all the data (no separation into validation and test sets). This is essentially a model fitting problem, with no influence from statistical sampling. We show results only for the autoassociators trained with SGD, since these returned lower error than the conjugate gradient simulators, and the memory requirements for BFGS were prohibitive. We report the results from those architectures that provided the lowest reconstruction error on the training data. The results are shown in Table 5. Both nonlinear techniques produce encodings with lower error than PCA, indicating significant nonlinear structure in the data. For the same data and using a five-layer autoassociator with 30 nodes in each of the second and fourth layers, DeMers and Cottrell (1993) obtain a reconstruction error Enorm = 0.1317.9 This is comparable to our results. We note that the VQPCA algorithms train two orders of magnitude faster than the networks while obtaining encodings with about one-third the reconstruction error. It is useful to examine the images obtained from the encodings for the various algorithms. Figure 4 shows two sample images from the data set along with their reconstructions from five-dimensional encodings. The algorithms correspond to those reported in Table 5. These two images were selected because their reconstruction error closely matched the average. The left-most column shows the images as reconstructed from the 50 principal components. The second column shows the reconstruction from 5 princi9 DeMers and Cottrell report half the MSE per output node, E = (1/2) ∗ (1/50) ∗ MSE = 0.001. This corresponds to Enorm = 0.1317.
Dimension Reduction by Local PCA
1511
Figure 4: Two representative images: Left to right—Original 50 principal components reconstructed image, reconstruction from 5-D encodings: PCA, Autoassoc-SGD(40), VQPCAEucl(25), and VQPCA-Recon(30). The normalized reconstruction errors and training times for the whole data set (all the images) are given in Table 5.
pal components. The third column is the reconstruction from the five-layer autoassociator, and the last two columns are the reconstruction from the Euclidean and reconstruction distance VQPCA. The five-dimensional PCA has grossly reduced resolution, and gray-scale distortion (e.g., the hair in the top image). All of the nonlinear algorithms produce superior results, as indicated by the reconstruction error. The lower image shows a subtle difference between the autoassociator and the two VQPCA reconstructions; the posture of the mouth is correctly recovered in the latter. 5 Discussion We have applied locally linear models to the task of dimension reduction, finding superior results to both PCA and the global nonlinear model built by five-layer autoassociators. The local linear models train significantly faster than autoassociators, with their training time dominated by the partitioning step. Once the model is trained, encoding new data requires computing which region of the partition the new data fall in, and thus VQPCA requires more computation for encoding than does the autoassociator. However, decoding data with VQPCA is faster than decoding with the autoassociator and comparable to decoding with PCA. With these considerations, VQPCA is perhaps not optimal for real-time encoding, but its accuracy and computational speed for decoding make it superior for applications like image retrieval from databases.
1512
Nandakishore Kambhatla and Todd K. Leen
In order to optimize the partition with respect to the accuracy of the projected data, we introduced a new distortion function for the VQ, the reconstruction distance. Clustering with the reconstruction distance does indeed provide lower reconstruction error, though the difference is data dependent. We cannot offer a definitive reason for the superiority of the VQPCA representations over five-layer autoassociators. In particular, we are uncertain how much of the failure of autoassociators is due to optimization problems and how much to the representation capability. We have observed that changes in initialization result in large variability in the reconstruction error of solutions arrived at by autoassociators. We also see strong dependence on the optimization technique. This lends support to the notion that some of the failure is a training problem. It may be that the space of architectures we have explored does have better solutions but that the available optimizers fail to find them. The optimization problem is a very real block to the effective use of fivelayer autoassociators. Although the architecture is an elegant construction for nonlinear dimension reduction, realizing consistently good maps with standard optimization techniques has proved elusive on the examples we considered here. Optimization may not be the only issue. Autoassociators constructed from neurons with smooth activation functions are constrained to generate smooth projections and immersions. The VQPCA algorithms have no inherent smoothness constraint. This will be an advantage when the data are not well described by a manifold.10 Moreover, Malthouse (1996) gives a simple example of data for which continuous projections are necessarily suboptimal. In fact, if the data are not well described by a manifold, it may be advantageous to choose the representation dimension locally, allowing a different dimension for each region of the partition. The target dimension could be chosen using a hold out data set, or more economically by a cost function that includes a penalty for model complexity. Minimum description length (MDL) criteria have been used for PCA (Hediger et al., 1990; Wax & Kailath, 1985) and presumably could be applied to VQPCA. We have explored no such methods for estimating the appropriate target dimension. In contrast, the autoassociator algorithm given by DeMers and Cottrell (1993) includes a pruning strategy to reduce progressively the dimensionality of the representation layer, under the constraint that the reconstruction error not grow above a desired threshold. Two applications deserve attention. The first is transform coding for which the algorithms discussed here are naturally suited. In transform coding, one transforms the data, such as image blocks, to a lower-redundancy
10
A visualization of just such a case is given in Kambhatla (1995).
Dimension Reduction by Local PCA
1513
representation and then scalar quantizes the new representation. This produces a product code for the data. Standard approaches include preprocessing by PCA or discrete cosine transform, followed by scalar quantization (Wallace, 1991). As discussed in the introduction, the nonlinear transforms considered here provide more accurate representations than PCA and should provide for better transform coding. This work suggests a full implementation of transform coding, with comparisons betweeen PCA, autoassociators, and VQPCA in terms of rate distortion curves. Transform coding using VQPCA with the reconstruction distance clustering requires additional algorithm development. The reconstruction distance distortion function depends explicitly on the target dimension, while the latter depends on the allocation of transform bits between the new coordinates. Consequently a proper transform coding scheme needs to couple the bit allocation to the clustering, an enhancement that we are developing. The second potential application is in novelty detection. Recently several authors have used three-layer autoassociators to build models of normal equipment function (Petsche et al., 1996; Japkowicz et al., 1995). Equipment faults are then signaled by the failure of the model to reconstruct the new signal accurately. The nonlinear models provided by VQPCA should provide more accurate models of the normal data, and hence improve the sensitivity and specificity for fault detection. Acknowledgments This work was supported in part by grants from the Air Force Office of Scientific Research (F49620-93-1-0253) and the Electric Power Research Institute (RP8015-2). We thank Gary Cottrell and David DeMers for supplying image data and the Center for Spoken Language Understanding at the Oregon Graduate Institute of Science and Technology for speech data. We are grateful for the reviewer’s careful reading and helpful comments. References Baldi, P., & Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2, 53–58. Bourlard, H., & Kamp, Y. (1988). Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cyb., 59, 291–294. Bregler, C., & Omohundro, S. M. (1995). Nonlinear image interpolation using manifold learning. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems 7. Cambridge, MA: MIT Press. Broomhead, D. S. (1991, July). Signal processing for nonlinear systems. In S. Haykin (Ed.), Adaptive Signal Processing, SPIE Proceedings (pp. 228–243). Bellingham, WA: SPIE.
1514
Nandakishore Kambhatla and Todd K. Leen
Cleveland, W. S., & Devlin, S. J. (1988). Locally weighted regression: An approach to regression analysis by local fitting. J. Amer. Stat. Assoc., 83, 596–610. Coetzee, F. M., & Stonick, V. L. (1995). Topology and geometry of single hidden layer network, least squares weight solutions. Neural Computation, 7, 672–705. Cottrell, G. W., & Metcalfe, J. (1991). EMPATH: Face, emotion, and gender recognition using holons. In R. Lippmann, J. Moody, & D. Touretzky (Eds.), Advances in neural information processing systems 3 (pp. 564–571). San Mateo, CA: Morgan Kaufmann. Cottrell, G. W., Munro, P., & Zipser, D. (1987). Learning internal representations from gray-scale images: An example of extensional programming. In Proceedings of the Ninth Annual Cognitive Science Society Conference (pp. 461–473). Seattle, WA. Darken, C., & Moody, J. (1991). Note on learning rate schedules for stochastic optimization. In R. Lippman, J. Moody, D. Touretzky, (Eds.), Advances in neural information processing systems 3. San Mateo, CA: Morgan Kaufmann. DeMers, D., & Cottrell, G. (1993). Non-linear dimensionality reduction. In C. Giles, S. Hanson, & J. Cowan (Eds.), Advances in neural information processing systems 5. San Mateo, CA: Morgan Kaufmann. Dony, R. D., & Haykin, S. (1995). Optimally adaptive transform coding. IEEE Transactions on Image Processing (pp. 1358–1370). Fisher, W. M., & Doddington, G. R. (1986). The DARPA speech recognition research database: Specification and status. In Proceedings of the DARPA Speech Recognition Workshop (pp. 93–99). Palo Alto, CA. Fukunaga, K., & Olsen, D. R. (1971). An algorithm for finding intrinsic dimensionality of data. IEEE Transactions on Computers, C-20, 176–183. Funahashi, K. (1989). On the approximate realization of continuous mappings by neural networks. Neural Networks, 2, 183–192. Gersho, A., & Gray, R. M. (1992). Vector quantization and signal compression. Boston: Kluwer. Golomb, B. A., Lawrence, D. T., & Sejnowski, T. J. (1991). Sexnet: A neural network identifies sex from human faces. In R. Lippmann, J. Moody, & D. Touretzky (Eds.), Advances in neural information processing systems 3 (pp. 572–577). San Mateo, CA: Morgan Kauffmann. Gray, R. M.(1984, April). Vector quantization. IEEE ASSP Magazine, pp. 4–29. Hastie, T. (1984). Principal curves and surfaces. Unpublished dissertation, Stanford University. Hastie, T., & Stuetzle, W. (1989). Principal curves. Journal of the American Statistical Association, 84, 502–516. Hecht-Nielsen, R. (1995). Replicator neural networks for universal optimal source coding. Science, 269, 1860–1863. Hediger, T., Passamante, A., & Farrell, M. E. (1990). Characterizing attractors using local intrinsic dimensions calculated by singular-value decomposition and information-theoretic criteria. Physical Review, A41, 5325–5332. Hinton, G. E., Revow, M., & Dayan, P. (1995). Recognizing handwritten digits using mixtures of linear models. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems 7. Cambridge, MA: MIT Press.
Dimension Reduction by Local PCA
1515
Hornik, M., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–368. Japkowicz, N., Myers, C., & Gluck, M. (1995). A novelty detection approach to classification. In Proceedings of IJCAI. Kambhatla, N. (1995) Local models and gaussian mixture models for statistical data processing. Unpublished doctoral dissertation, Oregon Graduate Institute. Kambhatla, N., & Leen, T. K. (1994). Fast non-linear dimension reduction. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems 6. San Mateo, CA: Morgan Kaufmann. Kramer, M. A. (1991). Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal, 37, 233–243. Kung, S. Y., & Diamantaras, K. I. (1990). A neural network learning algorithm for adaptive principal component extraction (APEX). In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (pp. 861–864). Malthouse, E. C. (1996). Some theoretical results on non-linear principal components analysis (Unpublished research report). Evanston, IL: Kellogg School of Management, Northwestern University. Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. Math. Biology, 15, 267–273. Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of Neural Systems, 1, 61–68. Oja, E. (1991). Data compression, feature extraction, and autoassociation in feedforward neural networks. In Artificial neural networks (pp. 737–745). Amsterdam: Elsevier Science Publishers. Petsche, T., Marcantonio, A., Darken, C., Hanson, S. J., Kuhn, G. M., & Santoso, I. (1996) A neural network autoassociator for induction motor failure prediction. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems 8. Cambridge, MA: MIT Press. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1987). Numerical recipes—the art of scientific computing. Cambridge: Cambridge University Press. Rognvaldsson, T. (1994). On Langevin updating in multilayer perceptrons. Neural Computation, 6, 916–926. Rubner, J., & Tavan, P. (1989). A self-organizing network for principal component analysis. Europhysics Lett., 20, 693–698. Sanger, T. (1989). An optimality principle for unsupervised learning. In D. S. Touretzky (ed.), Advances in neural information processing systems 1. San Mateo, CA: Morgan Kaufmann. Usui, S., Nakauchi, S., & Nakano, M. (1991). Internal color representation acquired by a five-layer neural network. In O. Simula, T. Kohonen, K. Makisara, & J. Kangas (Eds.), Artificial neural networks. Amsterdam: Elsevier Science Publishers, North-Holland. Wallace, G. K. (1991). The JPEG still picture compression standard. Communications off the ACM, 34, 31–44. Wax, M., & Kailath, T. (1985). Detection of signals by information theoretic criteria. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-33(2), 387–392.
1516
Nandakishore Kambhatla and Todd K. Leen
Yang, L., & Yu, W. (1993). Backpropagation with homotopy. Neural Computation, 5, 363–366.
Received September 6, 1996; accepted February 28, 1997.
Communicated by Robert Jacobs
A Constructive, Incremental-Learning Network for Mixture Modeling and Classification James R. Williamson Center for Adaptive Systems and Department of Cognitive and Neural Systems, Boston University, Boston, MA 02215, U.S.A.
Gaussian ARTMAP (GAM) is a supervised-learning adaptive resonance theory (ART) network that uses gaussian-defined receptive fields. Like other ART networks, GAM incrementally learns and constructs a representation of sufficient complexity to solve a problem it is trained on. GAM’s representation is a gaussian mixture model of the input space, with learned mappings from the mixture components to output classes. We show a close relationship between GAM and the well-known expectation-maximization (EM) approach to mixture modeling. GAM outperforms an EM classification algorithm on three classification benchmarks, thereby demonstrating the advantage of the ART match criterion for regulating learning and the ARTMAP match tracking operation for incorporating environmental feedback in supervised learning situations. 1 Introduction Adaptive resonance theory (ART) networks construct stable recognition categories for unsupervised clustering using fast, incremental learning (Carpenter & Grossberg, 1987). The size of clusters coded by ART categories is determined by a global match criterion. ART networks have been extended into supervised-learning ARTMAP networks, which use predictive feedback to regulate the ART clustering mechanism in order to learn multidimensional input-output mappings (Carpenter, Grossberg, & Reynolds, 1991). If an ARTMAP network’s prediction is incorrect, then a match tracking signal from the network’s output layer raises the match criterion and thus alters clustering in the ART module. In this way, ARTMAP networks realize perhaps the minimal possible extension to ART networks to enable supervised learning while preserving the ART design constraint of fast, incremental learning using only local update rules. In contrast, many online supervised-learning networks, such as multilayer perceptrons and adaptive radial basis function networks, are less local in nature because their gradient-descent learning algorithms require that error signals computed at each of the parameters in the output layer be fed back to each of the parameters in the hidden layer (Rumelhart, Hinton, & Williams, 1986; Poggio & Girosi, 1989). Neural Computation 9, 1517–1543 (1997)
c 1997 Massachusetts Institute of Technology °
1518
James R. Williamson
A new ARTMAP network called gaussian ARTMAP (GAM), which uses internal recognition categories that have gaussian-defined receptive fields, has recently been introduced and applied to several classification problems (Williamson, 1996a, 1996b; Grossberg & Williamson, 1997). GAM’s recognition categories learn a gaussian mixture model of the input space as well as mappings to the output classes. When GAM makes an incorrect prediction, its match tracking operation is triggered. The network’s vigilance level is raised by adjusting the match criterion, which restricts activation to only those categories that have a sufficiently good match to the input. Match tracking continues until a correct prediction is made, after which the network learns. Thus, match tracking dynamically regulates learning based on predictive feedback. In addition, if no committed categories satisfy the match criterion, a new, uncommitted category is chosen. By this process GAM incrementally constructs a representation of sufficient complexity to solve a classification problem. GAM is closely related to the expectation-maximization (EM) approach to mixture modeling (Dempster, Laird, & Rubin, 1977). We show that the EM algorithm for unsupervised density estimation using (separable) gaussian mixtures is essentially the same as the GAM equations for modeling the density of its input space, except that GAM is set apart by three features that are standard for ART networks: 1. GAM uses incremental learning, in which the parameters are updated after each input sample, whereas EM uses batch learning, which requires the entire data set. Incremental variants of the EM algorithm will also be discussed. 2. GAM restricts learning of the current data sample to the subset of categories that satisfy its match criterion, whereas EM allows all mixture components to be affected by all data samples. 3. GAM is a constructive network that chooses new, uncommitted categories during training when no committed categories satisfy its match criterion, whereas EM uses a constant, preset number of components. A straightforward extension of the unsupervised EM mixture modeling algorithm to supervised classification problems involves modeling the class label as a multinomial variable. In this way, the mixture components represent the I → O mapping from a real-valued input space to a discretevalued output space by modeling joint gaussian and multinomial densities in the I/O space (Ghahramani & Jordan, 1994). We show a close relationship between this EM classification algorithm and GAM. However, GAM is set apart by match tracking, which causes GAM to “pay attention” to its training errors and devote more resources to troublesome regions of its I/O space. GAM thereby learns a more effective representation of the I → O mapping than EM, as demonstrated by GAM’s superior performance to EM on three classification benchmarks.
A Constructive, Incremental-Learning Network
1519
2 Gaussian ARTMAP 2.1 Category Match and Activation. GAM consists of an input layer, F1 , and an internal category layer, F2 , which receives input from F1 via adaptive weights. Activations at F1 and F2 are denoted, respectively, by xE = (x1 , . . . , xM ) and yE = (y1 , . . . , yN ) where M is the dimensionality of the input space and N is the current number of committed F2 category nodes. Each F2 category, j, models a local density of the input space with a separable gaussian receptive field and maps to an output class prediction. The category’s receptive field is defined with a separable gaussian distribution parameterized by two M-dimensional vectors: its mean, µ E j , and standard deviation, σEj . A scalar, nj , also represents the amount of training data for which the node has received credit. Category j is activated only if its match, Gj , satisfies the match criterion, which is determined by a vigilance parameter, ρ. Match is a measure, obtained from the category’s unit-height gaussian distribution, of how close an input is to the category’s mean, relative to its standard deviation: Ã ¶ ! M µ xi − µji 2 1X . (2.1) Gj = exp − 2 i=1 σji The match criterion is a threshold: the category is activated only if Gj > ρ; otherwise, the category is reset. If the match criterion is satisfied, the category’s net input signal, gj , is determined by modulating its match value by nj , which is proportional to the category’s a priori probability, and by QM ( i=1 σji )−1 , which normalizes its gaussian distribution: nj gj = QM
i=1 σji
Gj if Gj > ρ;
gj = 0 otherwise.
(2.2)
The category’s activation, yj , represents its conditional probability for being the “source” of the input vector: P(j | xE). This is obtained by normalizing the category’s input strength, gj yj = PN l=1
gl
.
(2.3)
As originally proposed, GAM used a choice activation rule at F2 : yj = 1 if gj > gl ∀ l 6= j; yj = 0 otherwise (Williamson, 1996a). In this version, only a single, chosen category learned on each trial. Here, we describe a distributed-learning version of GAM, which uses the distributed F2 activation equation (2.3). Distributed GAM was introduced in Williamson (1996b), where it was shown to obtain a more efficient representation than GAM with choice learning. Distributed GAM has also been applied as part of an image classification system, where it outperformed existing state-of-the-art
1520
James R. Williamson
image classification system that use rule-based, multilayer perceptron, and k-nearest neighbor classifiers (Grossberg & Williamson, 1997). 2.2 Prediction and Match Tracking. Equations 2.1 through 2.3 describe activation of category nodes in an unsupervised-learning gaussian ART module. The following equations describe GAM’s supervised-learning mechanism, which incorporates feedback from class predictions made by the F2 category nodes and thus turns gaussian ART into gaussian ARTMAP. When a category, j, is first chosen, it learns a permanent mapping to the output class, k, associated with the current training sample. All categories that map to the same class prediction belong to the same ensemble: j ∈ E(k). Each time an input is presented, the categories in each ensemble sum their activations to generate a net probability estimate, zk , of the class prediction k that they share: X yj . (2.4) zk = j∈E(k)
The system prediction, K, is obtained from the maximum probability estimate, K = arg max(zk ),
(2.5)
k
which also determines the chosen ensemble. On real-world problems, the probability estimate zK has been found to predict accurately the probability that prediction K is correct (Grossberg & Williamson, 1997). Note that category j’s initial activation, yj , represents P(j | xE). Once the class prediction K is chosen, we obtain the category’s “chosen-ensemble” activation, yj∗ , which represents P(j | xE, K): yj∗ = P
yj
l∈E(K) yl
if j ∈ E(K);
yj∗ = 0 otherwise.
(2.6)
If K is the correct prediction, then the network resonates and learns the current input. If K is incorrect, then match tracking is invoked. As originally conceived, match tracking involves raising ρ continuously, causing categories j, such that Gj ≤ ρ, to be reset until the correct prediction is finally selected (Carpenter, Grossberg, & Rosen, 1991). Because GAM uses a distributed representation at F2 , each zk may be determined by multiple categories, according to equation 2.6. Therefore, it is difficult to determine numerically how much ρ needs to be raised in order to select a different prediction. It is inefficient (on a conventional computer) to determine the exact amount to raise ρ by repeatedly resetting the active category with the lowest match value Gj , each time reevaluating equations 2.3, 2.4, and 2.5, until a new prediction is finally selected.
A Constructive, Incremental-Learning Network
1521
Instead, a one-shot match tracking algorithm has been developed for GAM and used successfully on several classification problems (Williamson, 1996b; Grossberg & Williamson, 1997). This algorithm involves raising ρ to the average match value of the chosen ensemble: ¶2 M µ X X − µ x 1 i ji . y∗ ρ = exp − 2 j∈E(K) j i=1 σji
(2.7)
In addition, all categories in the chosen ensemble are reset: gj = 0 ∀ j ∈ E(K). Equations 2.2 through 2.5 are then reevaluated. Based on the remaining nonreset categories, a new prediction K in equation 2.5, and its corresponding ensemble, is chosen. This automatic search cycle continues until the correct prediction is made or until all committed categories are reset, Gj ≤ ρ ∀ j ∈ {1, . . . , N}, and an uncommitted category is chosen. Match tracking ensures that the correct prediction comes from an ensemble with a better match to the training sample than all reset ensembles. Upon presentation of the next training sample, ρ is reassigned its baseline value: ρ = ρ. E j and σEj are updated to represent the 2.3 Learning. The F2 parameters µ sample statistics of the input using local learning rules that are related to the instar, or gated steepest descent, learning rule (Grossberg, 1976). Instar learning is an associative rule in which the postsynaptic activity yj∗ modulates the rate at which the weight wji tracks the presynaptic signal f (xi ), ²
d wji = yj∗ [ f (xi ) − wji ]. dt
(2.8)
The discrete-time version of equation 2.8 is wji := (1 − ² −1 yj∗ )wji + ² −1 yj∗ f (xi ).
(2.9)
GAM’s learning equations are obtained by modifying equation 2.9. The rate constant ² is replaced by nj , which is incremented to represent the cumulative chosen-ensemble activation of node j and thus the amount of training data the node has been assigned credit for: nj := nj + yj∗ .
(2.10)
Modulation of learning by nj causes the inputs to be weighted equally over time, so that their sample statistics are learned. The presynaptic term f (xi ) is set to xi and x2i , respectively, for learning the first and second moments of the input. The standard deviation is then derived from these statistics: µji := (1 − yj∗ nj−1 )µji + yj∗ nj−1 xi ,
(2.11)
1522
James R. Williamson
νji := (1 − yj∗ nj−1 )νji + yj∗ nj−1 x2i , q σji = νji − µji2 .
(2.12) (2.13)
In Williamson (1996a, 1996b) and Grossberg and Williamson (1997), σji , rather than νji , is incrementally updated via: σji :=
q (1 − yj nj−1 )σji2 + yj nj−1 (xi − µji )2 .
(2.14)
Unlike equations 2.12 and 2.13, equation 2.14 biases the estimate of σji because the incremental updates are based on current estimates of µji , which vary over time. This bias appears to be insignificant, however, as our simulations have not revealed a significant advantage for either method. Equations 2.12 and 2.13 are used here solely because they describe a simpler learning rule. GAM is initialized with N = 0. When an uncommitted category is chosen, N is incremented, and the new category, indexed by N, is initialized with y∗N = 1 and nN = 0, and with a permanent mapping to the correct output class. Learning then proceeds via equations 2.10 through 2.13, with one modification: a constant, γ 2 , is added to νNi in equation 2.12, which yields σNi = γ in equation 2.13. Initializing categories with this nonzero standard deviation is necessary to make equation 2.1 and equation 2.2 well defined. Varying γ has a marked effect on learning: as γ is raised, learning becomes slower, but fewer categories are created. Generally, γ is much larger than the final standard deviation that a category converges to. Intuitively, a large γ represents a low level of certainty for, and commitment to, the location in the input space coded by a new category. As γ is raised, the network settles into its input space representation in a slower and more graceful way. Note that best results have generally been obtained by preprocessing the set of input vectors to have the same standard deviation in each dimension, so that γ has the same meaning in all the dimensions. 3 Expectation-Maximization Now we show the relationship between GAM and the EM approach to mixture modeling. EM is a general iterative optimization technique for obtaining maximum likelihood estimates of observed data that are in some way incomplete (Dempster et al., 1977). Each iteration of EM consists of an expectation step (E-step) followed by a maximization step (M-step). We start with an “incomplete-data” likelihood function of the model given the data and then posit a “complete-data” likelihood function, which is much easier to maximize but depends on unknown, missing data. The E-step finds the expectation of the complete-data likelihood function, yielding a deterministic function. The M-step then updates the system parameters to
A Constructive, Incremental-Learning Network
1523
maximize this function. Dempster et al. (1977) proved that each iteration of EM yields an increase in the incomplete-data likelihood until a local maximum is reached. 3.1 Gaussian Mixture Modeling. First, let us consider density estimation of the input space using (separable) gaussian mixtures. We model the training set, X = {E xt }Tt=1 , as comprising independent, identically distributed samples generated from a mixture density, which is parameterized N . The incomplete-data density of X given 2 is by 2 = {αj , θEj }j=1 P(X|2) =
T Y
P(E xt |2) =
t=1
N T X Y
αj P(E xt |θEj ),
(3.1)
t=1 j=1
where θEj parameterizes the distribution of the jth component, and αj reprePN αj = 1. sents its a priori probability, or mixture proportion: αj ≥ 0 and j=1 The incomplete-data log likelihood of 2 given X is l(2|X) =
T X t=1
log
N X
αj P(E xt |θEj ),
(3.2)
j=1
which is difficult to maximize because it includes the log of a sum. Intuitively, equation 3.2 contains a credit assignment problem, because it is not clear which component generated each data sample. To get around this problem, we introduce “missing data” in the form of a set of indicator variables, Z = {Ezt }Tt=1 , such that ztj = 1 if component j generated sample t and ztj = 0 otherwise. Now, using the complete data, {X, Z}, we can explicitly assign credit and thus decouple the overall maximization problem into a set of simple maximizations by defining a complete-data density function, P(X, Z|2) =
N T Y Y [αj P(E xt |θEj )]ztj ,
(3.3)
t=1 j=1
from which we obtain a complete-data log likelihood, lc (2|X, Z) =
N T X X
ztj log[αj P(E xt |θEj )],
(3.4)
t=1 j=1
which does not include a log of a sum. However, note that lc (2|X, Z) is a random variable because the missing variables Z are unknown. Therefore, the EM algorithm finds the expected value of lc (2|X, Z) in the E-step: Q(2|2(p) ) = E[lc (2|X, Z)|X, 2(p) ],
(3.5)
1524
James R. Williamson
where 2(p) is the set of parameters at the pth iteration. The E-step yields a deterministic function Q(2|2(p) ), and the M-step then maximizes this function with respect to 2 to obtain new parameters, 2(p+1) : 2(p+1) = arg max Q(2|2(p) ).
(3.6)
2
Dempster et al. (1977) proved that each iteration of EM yields an increase in the incomplete-data likelihood until a local maximum is reached: l(2(p+1) |X) ≥ l(2(p) |X).
(3.7) (p)
The E-step in equation 3.7 simplifies to computing (for all t, j) ytj = xt , 2], the probability that component j generated sample t. For comE[ztj |E parison with GAM, we define the density P(E xt |θEj ) as a separable gaussian distribution, yielding à ! µ (p) ¶2 ³ ´ x −µ P ti (p) QM (p) −1 M ji αj exp − 12 i=1 (p) i=1 σji (p)
ytj =
PN l=1
" (p) αl
³Q
´ (p) −1 M exp i=1 σli
σji
à − 12
PM i=1
µ
(p)
xti −µli
¶2 !# .
(3.8)
(p)
σli
For distributions in the exponential family, the M-step simply updates the model parameters based on their reestimated sufficient statistics, which are computed in a batch procedure that weights each sample by its probability, (p) ytj , (p+1)
αj
(p+1)
µji
(p+1)
σji
T 1X (p) y , T t=1 tj PT (p) t=1 ytj xti = P , (p) T t=1 ytj v u PT (p) 2 u u t=1 ytj xti ³ (p+1) ´2 =t P − µji . (p) T t=1 ytj
=
(p)
(3.9)
(3.10)
(3.11)
Note that ytj in equation 3.8 is equivalent to GAM’s category activation term yj in equation 2.3 provided that vigilance is zero (ρ = 0). Also, the EM parameter reestimation equations (3.9 through 3.11) are essentially the same as GAM’s learning equations (2.10 through 2.13), except that EM uses batch learning with a constant number of components, while GAM uses incremental learning, updating the parameters after each input sample, and recruiting new categories as needed.
A Constructive, Incremental-Learning Network
1525
3.2 Extension to Classification. The EM mixture modeling algorithm is extended to classification problems by modeling the class label as a multinomial variable (Ghahramani & Jordan, 1994). Therefore, each mixture component consists of a gaussian distribution for the “input” features and a multinomial distribution for the “output” class labels. Thus, the classification problem is cast as a density estimation problem, in which the mixture components represent the joint density of the input-output mapping. Specifically, the joint probability that the tth sample has input features xEt and output class k(t) is denoted by N X
P(E xt , K = k(t)|2) =
Ej ) αj P(E x, K = k(t)|θEj , λ
j=1 N X
=
λjk(t) αj P(E x|θEj ),
(3.12)
j=1
E where Pthe multinomial distribution is parameterized by λjk = P(K = k|j; θj ) and k λjk = 1. This classification algorithm is trained the same way as the gaussian mixture algorithm, except that equation 3.8 becomes:
(p) ytj
(p) (p) λjk(t) αj
=
PN l=1
³Q
´ (p) −1 M σ exp i=1 ji
" (p) (p) λlk(t) αl
³Q
à − 12
´ (p) −1 M σ exp i=1 li
PM
à − 12
µ
(p)
xti −µji
¶2 !
(p)
i=1
σji
PM i=1
µ
(p)
xti −µli
¶2 !# ,
(3.13)
(p)
σli
and the multinomial parameters are updated via (p+1) λjk
PT
(p) t=1 ytj δ[k − PT (p) t=1 ytj
=
k(t)]
,
(3.14)
where δ[ω] = 1 if ω = 0 and δ[ω] = 0 if ω 6= 0. During testing, the class label is missing and its expected value is “filled in” to determine the system prediction: N X ytj λjk , K = arg max k
(3.15)
j=1
where ytj is computed via equation 3.8. The parameter λjk plays an analogous role to GAM’s membership function, j ∈ E(k). GAM’s equation (2.5) performs the same computation as equation 3.15, provided that λjk = 1 if j ∈ E(k) and λjk = 0 otherwise. Note that if each EM component is initialized
1526
James R. Williamson
so that λjk = 1 for some k, then λjk will never change according to update equation 3.14. Therefore, with this restriction, along with the restriction that vigilance is always zero (ρ ≡ 0), EM becomes a batch-learning version of GAM. Thus, GAM and EM use similar learning equations and obtain a similar final representation: a gaussian mixture model, with mappings from the mixture components to class labels. However, the learning dynamics of the two algorithms are quite different due to GAM’s match tracking operation. The EM algorithm is a variable metric gradient ascent algorithm, in which each step in parameter space is related to the gradient of the log likelihood of the mixture model (Xu & Jordan, 1996). With each step, the likelihood of the I/O density estimate increases until a local maximum is reached. In other words, the parameterization at each step is represented by a point in the parameter space, which has a constant dimensionality. The system is initialized at some point in this parameter space, and the point moves with each training epoch, based on the gradient of the log likelihood, until a local maximum of the likelihood is reached. GAM’s parameters are updated using an incremental approximation to the batch-learning EM algorithm. However, the most important respect in which GAM’s learning procedure differs from that of EM is that the former uses predictive feedback via the match tracking process. When errors occur in the I → O mapping, match tracking reduces, by varying amounts, the number of categories that learn and thus restricts the movement of GAM’s parameterization to a parameter subspace. Match tracking also causes uncommitted categories to be chosen, which expands the dimensionality of the parameter space. Newly committed categories have small a priori probabilities and large standard deviations, and thus a weak but ubiquitous influence on the gradient. 3.3 Incremental Variants of EM. One of the practical advantages of GAM over the standard EM algorithm described in sections 3.1 and 3.2 is that the former learns incrementally whereas the latter learns using a batch method. However, incremental variants of EM have also been developed. Most notably, Neal and Hinton (1993) showed that EM can incrementally update the model parameters if a separate set of sufficient statistics is stored for each input sample. That is, a separate set of the statistics computed in equations 3.9 through 3.11 and 3.14, corresponding to each input sample, can be saved. In this way, the E-step and M-step can be computed following the presentation of each sample, with the maximization affecting only the statistics associated with the current sample. Because this incremental EM recomputes expectations and maximizations following each input sample, it incorporates new information immediately into its parameter updates and thereby converges more quickly than the standard EM batch algorithm. Incremental EM illustrates the statistics that need to be maintained in order to ensure monotonic convergence of the model likelihood in an online
A Constructive, Incremental-Learning Network
1527
setting. However, the need to store separate statistics for each input sample makes incremental EM extremely nonlocal and, moreover, quite impractical for use on large data sets. On the other hand, there also exist incremental approximations of EM that use local learning but do not guarantee monotonic convergence of the model likelihood. For example, Hinton and Nowlan (1990) used an incremental equation for updating variance estimates that is identical to equation 2.14, except that a constant learning rate coefficient was used rather than the decaying term, nj−1 . GAM differs from standard EM due to both GAM’s match tracking procedure and its incremental approximation to EM’s learning equations. To make comparisons of GAM and EM on real-world classification problems more informative, the effects of each of these differences should be isolated. Therefore, in the following section GAM is compared to an incremental EM approximation as well as to the standard EM algorithm. Furthermore, in order to isolate the role played by match tracking, we use GAM’s learning equations as the incremental EM approximation. 4 Simulations: Comparisons of GAM and EM 4.1 Methodology. All three classification tasks are evaluated using the same procedure. The data sets are normalized to have unit variance in each dimension. EM is tested with several different N. For each setting of N, EM is trained five times with different initializations, and the five test results are averaged. EM’s performance often peaks quickly and then declines due to overfitting, particularly when N is large. Therefore, EM’s performance is plotted following two training epochs, when its best performance is generally obtained (for large N), and also following equilibration. GAM uses ρ ≈ 0 (precisely, ρ = 10−7M ) for all simulations, and γ is varied. For each setting of γ , GAM is trained five times with different random orderings of the data, and the data order is also scrambled between each training epoch. GAM is trained for 100 epochs, and the test results are plotted after each epoch. As the results below illustrate, GAM often begins with relatively poor performance for the first few training epochs (particularly when γ is large because it biases the initial category standard deviations), after which performance improves. Performance sometimes peaks and then declines due to overfitting. To convey a full picture of GAM’s behavior, GAM’s average performance is plotted for each of its 100 training epochs and for each setting of γ . Several initialization procedures for EM were evaluated and found to produce widely different results. The most successful of these is reported here. As it happens, this procedure initializes EM components in essentially the same way that GAM categories are initialized, except that the former are all initialized prior to training. Specifically, each mixture component is assigned to one of N randomly selected samples, denoted by {E xt , k(t)}N t=1 ,
1528
James R. Williamson
and initialized as follows: αj = 1/N, µji = xti , σji = γ , and λjk = δ[k − k(t)]. In addition, it is guaranteed that at least one component maps to each of the output classes. Because each EM component maps only to a single output class, EM and GAM use the same representation and thus have the same storage requirement for a given N. It is fortuitous that EM’s best initialization procedure corresponds so closely to that of GAM. This makes the comparison between the two algorithms more revealing because it isolates the role played by match tracking. Apart from their batch–incremental-learning distinction, this EM algorithm operates identically to a GAM network that has a constant ρ ≡ 0. This is because EM equation 3.13, in which “feedback” from the class label directly “activates” the mixture components, is functionally equivalent to the GAM process of resetting all ensemble categories that make a wrong prediction, and finally basing learning on the chosen-ensemble activations in equation 2.6 that correspond to the correct prediction. Thus, GAM is set apart only by match tracking, which raises ρ according to equation 2.7 when an incorrect prediction is made. The contribution of match tracking can be further isolated by removing the batch–incremental-learning distinction. This is done by using a staticGAM (S-GAM) network, which is identical to GAM except that it has a fixed vigilance (ρ ≡ ρ), which prevents S-GAM from committing new categories during training because the baseline vigilance, ρ = 10−7M , is too small to reset any of the committed categories. Therefore, S-GAM needs to be initialized with a set of N categories prior to training. S-GAM is initialized the same way as EM (described above), with each of the N categories assigned to one of N randomly selected training samples and with an ensemble that maps to each of the output classes. If S-GAM makes an incorrect prediction during training, then the chosen ensemble is reset, but vigilance is not raised. Because vigilance is never raised, match tracking has no effect on learning other than to ensure that the correct prediction is made before learning occurs. Therefore, S-GAM’s procedure is functionally identical to the EM procedure of directly activating categories based on both the input and the supervised class label. By including comparisons to S-GAM, therefore, the effects of match tracking alone (GAM versus S-GAM), the effects of incremental learning alone (S-GAM versus EM), and the effects of match tracking and incremental learning together (GAM versus EM) are revealed. 4.2 Letter Image Recognition. EM, GAM, and S-GAM are first evaluated on a letter image recognition task developed in Frey and Slate (1991). The data set, which is archived in the UCI machine learning repository (King, 1992), consists of 16-dimensional vectors derived from machinegenerated images of alphabetical characters (A to Z). The classification problem is to predict the correct letter from the 16 features. Classification difficulty stems from the fact that the characters are generated from 20 different fonts, are randomly warped, and only simple features such as the
A Constructive, Incremental-Learning Network
1529
Table 1: Letter Image Classification. Algorithm k-NN HAC (a) HAC (b) HAC (c) GAM-CL (γ = 2) GAM-CL (γ = 4) FAM (α = 1.0) FAM (α = 0.1)
Error Rate (%) 4.4 19.2 18.4 17.3 6.0 6.3 8.1 13.5
Source: Results are adapted from Frey and Slate (1991) and Williamson (1996a). Notes: k-NN = nearest-neighbor classifier (k = 1). HAC = Holland-style adaptive classifier. Results of three HAC variations are shown here (see Frey & Slate, 1991). GAM-CL = Gaussian ARTMAP with choice learning. FAM = Fuzzy ARTMAP.
total number of “on” pixels and the size and position of a box around the “on” pixels are used. The data set consists of 20,000 samples, the first 16,000 of which are used for training and the last 4000 for testing. For comparison, the results of several other classifiers are shown in Table 1 (see Frey & Slate, 1991; Williamson, 1996a). Figure 1 shows the classification results of EM and GAM on the letter image recognition problem. EM’s average error rate is plotted as a function of N (N = 100, 200, . . . , 1000). For each N, the error rate is shown after two training epochs (solid line) and after EM has equilibrated (dashed line). Note that with N < 600, EM performs better following equilibration, but with N > 600, EM performs better following two epochs, and then performance declines, presumably due to overfitting. EM’s best performance (7.4 ± 0.2 percent error) is obtained with N = 1000 following two training epochs. GAM’s average error rate is plotted for γ = 1, 2, 4. Each point along one of GAM’s error curves corresponds to a different training epoch, with the first epoch plotted at the left-most point of a curve. As training proceeds, the number of categories increases, and the error rate decreases. After a certain point, the error rate may increase again due to overfitting. As γ is raised, the error curves shift to the left. The initial performance becomes progressively worse, and it takes longer for GAM’s performance to peak. However, fewer categories are created, and there is less degradation due to
1530
James R. Williamson
Letter Image Classification EM, 2 epochs, gamma = 1 EM, equilibrium, gamma = 1 GAM, gamma = 1 GAM, gamma = 2 GAM, gamma = 4
Error Rate (%)
25
20
15
10
5 100
200
300
400
500
600
700
800
900
1000
Number of Categories Figure 1: The average error rates of EM and GAM on the letter image classification problem plotted as a function of the number of categories, N. EM’s error rates are shown after two training epochs and after equilibration. EM is trained with N = 100, 200, . . . , 1000. GAM is trained with γ = 1, 2, 4. GAM’s error rates are plotted after each training epoch. Each of GAM’s error curves corresponds to a different value of γ , with the left-most point on a curve corresponding to the first epoch. From left to right along a curve, each successive point corresponds to a successive training epoch.
overfitting. GAM’s best performance (5.4 ± 0.3 percent error) is obtained with γ = 1 following six training epochs. For all settings of γ , GAM achieves lower error rates than EM for all N. As γ is raised, GAM requires more training epochs to surpass EM. However, EM does achieve reasonably low error rates with N much smaller than that created by GAM for any value of γ . Figure 1 also suggests a general pattern
A Constructive, Incremental-Learning Network
1531
Table 2: The Trade-Offs Entailed by the Choice of γ for GAM for Three Classification Problems. γ a.
Number of Epochs
Error Rate (%)
Number of Categories
Storage Ratea (%)
812.8 941.2 807.2 712.4 593.8
10.5 12.1 10.4 9.2 7.7
125.0 255.2 212.0 149.2 93.4
5.7 11.7 9.7 6.8 4.3
72.4 62.2 53.8 46.6
28.8 24.7 21.4 18.5
Letter image classification 1 1 2 4 8
b.
1 6 38 100 100
7.7 5.4 6.0 6.6 8.8
Satellite image classification 1 1 2 4 8
c.
1 100 100 100 100
12.4 10.0 11.2 12.2 13.9
Spoken vowel classification 1 2 4 8
1 5 6 19
49.7 48.7 44.0 43.9
Notes: The lowest error rates (averaged over five runs) obtained for each of four settings of γ (γ = 1, 2, 4, 8) are shown. The error rate obtained with γ = 1 after only one training epoch is also shown to illustrate GAM’s fast-learning capability. a The amount of storage used by GAM divided by the amount used by the training set.
in GAM’s performance: in a plot of error rate as a function of number of categories, there exists (roughly) a U-shaped envelope. For different values of γ , GAM approaches and then follows a different portion of that envelope. As the remaining simulations illustrate, however, the relationship between γ and the placement of this envelope varies between different tasks. To illustrate further the trade-offs entailed by the choice of γ , Table 2a lists the lowest error rate obtained on this problem for different values of γ , along with the number of training epochs used to reach that point and the number of categories created. Finally, the storage rate, which is the amount of storage used by GAM relative to that used by the training set (and hence, by a nearest-neighbor classifier), is listed. For an M-dimensional input, each GAM category stores 2M + 1 values, so the storage rate is calculated as: N(2M + 1) . TM In Figure 2, the error rates are plotted as a function of the number of training epochs. GAM’s results are shown given its best parameter setting
1532
James R. Williamson
Letter Image Classification EM (N = 1000), gamma = 1 GAM, gamma = 1 S-GAM (N = 1000), gamma = 1
10
Error Rate (%)
9
8
7
6
5 0
10
20
30
40
50
60
70
80
90
100
Number of Epochs Figure 2: Average error rates of EM, GAM, and S-GAM plotted as a function of the number of training epochs.
(γ = 1), and the results of EM and S-GAM are shown given similar parameters (γ = 1, N = 1000). GAM quickly achieves its best performance at six epochs, after which performance slowly degrades due to overfitting. EM quickly achieves its best performance at two epochs, after which performance quickly degrades and then stabilizes. S-GAM’s performance improves more slowly than EM, but S-GAM eventually obtains lower error rates than EM. For other values of N, the relative performance of EM and S-GAM is similar to that shown in Figure 2. Therefore, on this problem SGAM’s incremental approximation to EM’s batch learning seems to confer a small advantage. Much of this advantage is probably due to the fact that S-GAM’s standard deviations decrease less than EM’s due to the lingering effect of γ . Additional simulations have shown that EM suffers less from
A Constructive, Incremental-Learning Network
1533
overfitting if the shrinkage of its standard deviations is attenuated during learning, although this variation does not appear to improve its best results. Figures 1 and 2 show that match tracking causes GAM to construct an appropriate number of categories to support the I → O mapping and to learn the mapping more accurately than EM or S-GAM. However, these results do not indicate why this is the case. There are two possible explanations for why match tracking gives GAM an advantage: (1) match tracking causes GAM to obtain a better estimate of the I/O density (i.e., to obtain a mixture model with a higher likelihood) (2) match tracking biases GAM’s density estimate so as to reduce its predictive error in the I → O mapping. The results shown in Figures 3 and 4 indicate that the latter explanation is correct. Figure 3 shows the error rate over 25 training epochs for a single run of GAM, on which GAM created 985 categories: the error rate on the training set (top) and the test set (bottom). Figure 3 also shows the corresponding error rates for EM and S-GAM initialized with the same number (N = 985) of categories as were created by GAM. On both the training set and the test set, GAM obtains much lower error rates than EM and S-GAM. Next, Figure 4 shows the log likelihoods of the mixture models formed by EM, GAM, and S-GAM. EM obtains a much higher log likelihood than GAM and S-GAM on both the training set and the test set, whereas the log likelihoods obtained by GAM and S-GAM are similar. Therefore, GAM outperforms EM and S-GAM because match tracking biases its density estimate such that the predictive error in its I → O mapping is reduced, despite the fact that GAM obtains an estimate of the I/O density with a lower likelihood than that of EM. 4.3 Landsat Satellite Image Segmentation. EM, GAM, and S-GAM are evaluated on a real-world task: segmentation of a Landsat satellite image (Feng, Sutherland, King, Muggleton, & Henery, 1993). The data set, which is archived in the UCI machine learning repository (King, 1992), consists of multispectral values within nonoverlapping 3 × 3 pixel neighborhoods in an image obtained from a Landsat satellite multispectral scanner. At each pixel are four values, corresponding to four spectral bands. Two of these are in the visible region (corresponding approximately to green and red regions of the visible spectrum) and two are in the (near) infrared. The input space is thus 36-dimensional (9 pixels and 4 values per pixel). The spatial resolution of a pixel is about 80 m × 80 m. The center pixel of each neighborhood is associated with one of six vegetation classes: red soil, cotton crop, gray soil, damp gray soil, soil with vegetation stubble, and very damp gray soil. The data set is partitioned into 4435 samples for training, and 2000 samples for testing. For comparison, the results of several other classifiers are shown in Table 3 (see Feng et al., 1993; Asfour, Carpenter, & Grossberg, 1995). Figure 5 shows the classification results of EM and GAM on the satellite image segmentation problem. EM’s error rate is plotted for N = 25, 50, . . . ,
1534
James R. Williamson
Letter Image Classification 12 EM (N = 985), gamma = 1 GAM, gamma = 1 S-GAM (N = 985), gamma = 1
Training Set
10
8
6
Error Rate (%)
4
2
0 0
5
10
15
20
25
12 EM (N = 985), gamma = 1 GAM, gamma = 1 S-GAM (N = 985), gamma = 1 10
Test Set
8
6
4
2
0 0
5
10
15
20
25
Number of Epochs Figure 3: Error rates of EM, GAM, and S-GAM on the training set (top) and the test set (bottom) plotted as a function of the number of training epochs.
A Constructive, Incremental-Learning Network
1535
Letter Image Classification 4 EM (N = 985), gamma = 1 GAM, gamma = 1 S-GAM (N = 985), gamma = 1
2
Training Set
-2
-4
-6
-8
-10
-12 0
5
10
15
20
25
4 EM (N = 985), gamma = 1 GAM, gamma = 1 S-GAM (N = 985), gamma = 1
2
0
Test Set
Log−Likelihood of Mixture Model
0
-2
-4
-6
-8
-10
-12 0
5
10
15
20
25
Number of Epochs Figure 4: Log likelihoods of EM, GAM, and S-GAM on the training set (top) and test set (bottom) plotted as a function of the number of training epochs.
1536
James R. Williamson
Table 3: Satellite Image Classification. Algorithm
Error Rate (%)
k-NN FAM RBF Alloc80 INDCART CART MLP NewID C4.5 CN2 Quadra SMART LogReg Discrim CASTLE
10.6 11.0 12.1 13.2 13.7 13.8 13.9 15.0 15.1 15.2 15.3 15.9 16.9 17.1 19.4
Source: Results adapted from Feng et al. (1993) and Asfour et al. (1995). Note: The k-NN result reported here, which we obtained, is different from the k-NN result reported in Feng et al. (1993).
275, following two epochs and following equilibration. For N < 100 EM performs better after equilibration, whereas for N > 100 EM performs better after two epochs. EM’s best results (10.6 ± 0.4 percent error) are obtained with N = 275. GAM’s error rate is plotted for γ = 1, 2, 4. Once again, GAM achieves the lowest overall error rate (10.0 ± 0.4 percent error), although on this problem GAM outperforms EM only with γ = 1. It is difficult to distinguish GAM’s error curves because they overlap each other so much. Unlike the letter image recognition results, none of GAM’s error curves tails up due to overtraining. Therefore, all the curves appear to be on the left side of our hypothetical U-shaped envelope. Table 2b reports the lowest error rate for different values of γ , along with the relevant statistics: number of training epochs, number of categories, and storage rate. Figure 6 plots the error rates as a function of the number of training epochs for GAM (γ = 1), EM (γ = 1, N = 250), and S-GAM (γ = 1, N = 250). GAM’s performance generally increases throughout, whereas EM’s performance again peaks at two epochs and then quickly degrades and stabilizes. On this problem EM outperforms S-GAM.
A Constructive, Incremental-Learning Network
1537
Figure 5: Average error rates of EM and GAM plotted as a function of the number of categories.
4.4 Speaker-Independent Vowel Recognition. Finally, EM, GAM, and S-GAM are evaluated on another real-world task: speaker-independent vowel recognition (Deterding, 1989). The data set is archived in the CMU connectionist benchmark collection (Fahlman, 1993). The data were collected by Deterding (1989), who recorded examples of the 11 steady-state vowels of English spoken by 15 speakers. A word containing each vowel was spoken once by each of the 15 speakers (7 females and 8 males). The speech signals were low pass filtered at 4.7 kHz and then digitized to 12 bits with a 10-kHz sampling rate. Twelfth-order linear predictive analysis was carried out on six 512-sample Hamming windowed segments from the steady part of the vowel. The reflection coefficients were used to calculate 10 log area parameters, giving a 10-dimensional input space. Each speaker thus
1538
James R. Williamson
Figure 6: Average error rates of EM, GAM, and S-GAM plotted as a function of the number of training epochs.
yielded six samples of speech from the 11 vowels, resulting in 990 samples from the 15 speakers. The data are partitioned into 528 samples for training, from four male and four female speakers, and 462 samples for testing, from the remaining four male and three female speakers. For comparison, the results of several other classifiers are shown in Table 4 (see Robinson, 1989; Fritzke, 1994; Williamson, 1996a). Figure 7 shows the classification results of EM and GAM on the vowel recognition problem. EM’s error rate is plotted for N = 15, 20, . . . , 70, following two epochs and following equilibration. For all N, EM obtains better results after two epochs than after equilibration. EM’s best results (45.4 ± 1.1 percent error) are obtained with N = 40. GAM’s error rate is plotted for γ = 2, 4, 8. By a small margin, GAM achieves the lowest overall
A Constructive, Incremental-Learning Network
1539
Table 4: Spoken Vowel Classification. Algorithm k-NN MLP MKM RBF GNN SNN 3-D GCS 5-D GCS GAM-CL (γ = 2) GAM-CL (γ = 4) FAM (α = 1.0) FAM (α = 0.1)
Error Rate (%) 43.7 49.4 57.4 52.4 46.5 45.2 36.8 33.5 43.3 41.8 48.9 50.4
Source: Results adapted from Robinson (1989), Fritzke (1994), and Williamson (1996a). Notes: Gradient descent networks (using 88 internal nodes) are: MLP (multilayer perceptron), MKM (modified Kanerva model), RBF (radial basis function), GNN (gaussian node network), and SNN (square node network). Constructive networks are: GCS (growing cell structures), GAM-CL, and FAM.
error rate (43.9 ± 1.4 percent error), but outperforms EM only with γ = 4 and γ = 8. The data set is quite small, and hence both algorithms have a strong tendency to overfit the data. Table 2c reports the lowest error rate for different values of γ , along with the relevant statistics: the number of training epochs, number of categories, and storage rate. Figure 8 plots the error rates as a function of the number of training epochs for GAM (γ = 3), EM (γ = 3, N = 40), and S-GAM (γ = 3, N = 40). GAM’s performance peaks at 19 epochs and then slowly degrades. EM’s performance peaks at 2 epochs and then degrades quickly and severely before stabilizing. S-GAM’s error rate, which stabilizes near 50 percent, never reaches EM’s best performance. 5 Conclusions GAM learns mappings from a real-valued space of input features to a discrete-valued space of output classes by learning a gaussian mixture model of the input space as well as connections from the mixture components to the output classes. The mixture components correspond to nodes in GAM’s internal category layer. GAM is a simple neural architecture that em-
1540
James R. Williamson
Spoken Vowel Classification EM, 2 epochs, gamma = 8 EM, equilibrium, gamma = 8 GAM, gamma = 2 GAM, gamma = 4 GAM, gamma = 8
80
Error Rate (%)
75
70
65
60
55
50
45
40 20
30
40
50
60
70
Number of Categories Figure 7: Average error rates of EM and GAM plotted as a function of the number of categories.
ploys constructive, incremental, local learning rules. These learning rules allow GAM to create a representation of appropriate size as it is trained online. We have shown a close relationship between GAM and an EM algorithm that estimates the joint I/O density by optimizing a gaussian-multinomial mixture model. The major difference between GAM and EM is GAM’s match tracking procedure, which raises a match criterion following incorrect predictions and prevents nodes from learning if they do not satisfy the raised match criterion. Match tracking biases GAM’s estimate of the joint I/O density such that GAM’s predictive error in the I → O map is reduced. With this biased density estimate, GAM outperforms EM on classification bench-
A Constructive, Incremental-Learning Network
1541
Figure 8: Average error rates of EM, GAM, and S-GAM plotted as a function of the number of training epochs.
marks despite the fact that GAM uses a suboptimal incremental approximation to EM’s batch learning rules and learns mixture models that have lower likelihoods than those optimized by EM. Acknowledgments This work was supported in part by the Office of Naval Research (ONR N00014-95-1-0409). References Asfour, Y., Carpenter, G. A., & Grossberg, S. (1995). Landsat satellite image segmentation using the fuzzy ARTMAP neural network (Tech. Rep. No. CAS/CNS
1542
James R. Williamson
TR-95-004). Boston: Boston University. Carpenter, G. A., & Grossberg, S. (1987). A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37, 54–115. Carpenter, G. A., Grossberg, S., & Reynolds, J. (1991). ARTMAP: Supervised realtime learning and classification of nonstationary data by a self-organizing neural network. Neural Networks, 4, 565–588. Carpenter, G. A., Grossberg, S., & Rosen, D. B. (1991). Fuzzy ART: Fast stable learning and categorization of analog patterns by an adaptive resonance system. Neural Networks, 4, 759–771. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Satistical Society Series B, 39, 1–38. Deterding, D. H. (1989). Speaker normalisation for automatic speech recognition. Unpublished doctoral dissertation, University of Cambridge. Fahlman, S. E. (1993). CMU benchmark collection for neural net learning algorithms. [Machine-readable data repository]. Pittsburgh: Carnegie Mellon University, School of Computer Science. Feng, C., Sutherland, A., King, S., Muggleton, S., & Henery, R. (1993). Symbolic classifiers: Conditions to have good accuracy performance. In Proceedings of the Fourth International Workshop on Artificial Intelligence and Statistics (pp. 371– 380) Waterloo, Ontario: University of Waterloo. Frey, P. W. & Slate, D. J. (1991). Letter recognition using Holland-style adaptive classifiers. Machine Learning, 6, 161–182. Fritzke, B. (1994). Growing cell structures—A self-organizing network for unsupervised and supervised learning. Neural Networks, 7, 1441–1460. Ghahramani, Z. & Jordan, M. I. (1994). Supervised learning from incomplete data via an EM approach. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in Neural Information Processing Systems, 9. San Mateo, CA: Morgan Kauffman. Grossberg, S., (1976). Adaptive pattern classification and universal recoding. I: Parallel development and coding of neural feature detectors. Biological Cybernetics, 23, 121–134. Grossberg, S., & Williamson, J. (1997). A self-organizing neural system for learning to recognize textured scenes (Tech. Rep. No. CAS/CNS TR-97-001). Boston, MA: Boston University. Hinton, G. E., & Nowlan, S. J. (1990). The bootstrap Widrow-Hoff rule as a cluster-formation algorithm. Neural Computation, 2, 355–362. King, R. (1992). Statlog databases. UCI Repository of machine learning databases. [Machine readable repository at ics.uci.edu:/pub/machinelearning-databases]. Neal, R. M., & Hinton, G. E. (1993). A new view of the EM algorithm that justifies incremental and other variants. Unpublished manuscript. Poggio, T., & Girosi, F. (1989). A theory of networks for approximation and learning (A.I. Memo No. 1140). Cambridge, MA: M.I.T., 1989. Robinson, A. J. (1989). Dynamic error propagation networks. Unpublished doctoral dissertation, Cambridge University.
A Constructive, Incremental-Learning Network
1543
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In Parallel Distributed Processing (pp. 318– 362). Cambridge, MA: MIT Press. Williamson, J. R. (1996a). Gaussian ARTMAP: A neural network for fast incremental learning of noisy multidimensional maps. Neural Networks, 9, 881–897. Williamson, J. R. (1996b). Neural networks for image processing, classification, and understanding. Unpublished doctoral dissertation, Boston University. Xu, L. & Jordan, M. I. (1996). On convergence properties of the EM algorithm for gaussian mixtures. Neural Computation, 8, 129–151.
Received July 9, 1996, accepted January 29, 1997.
Communicated by Shimon Ullman
Shape Quantization and Recognition with Randomized Trees Yali Amit Department of Statistics, University of Chicago, Chicago, IL, 60637, U.S.A.
Donald Geman Department of Mathematics and Statistics, University of Massachusetts, Amherst, MA 01003, U.S.A.
We explore a new approach to shape recognition based on a virtually infinite family of binary features (queries) of the image data, designed to accommodate prior information about shape invariance and regularity. Each query corresponds to a spatial arrangement of several local topographic codes (or tags), which are in themselves too primitive and common to be informative about shape. All the discriminating power derives from relative angles and distances among the tags. The important attributes of the queries are a natural partial ordering corresponding to increasing structure and complexity; semi-invariance, meaning that most shapes of a given class will answer the same way to two queries that are successive in the ordering; and stability, since the queries are not based on distinguished points and substructures. No classifier based on the full feature set can be evaluated, and it is impossible to determine a priori which arrangements are informative. Our approach is to select informative features and build tree classifiers at the same time by inductive learning. In effect, each tree provides an approximation to the full posterior where the features chosen depend on the branch that is traversed. Due to the number and nature of the queries, standard decision tree construction based on a fixed-length feature vector is not feasible. Instead we entertain only a small random sample of queries at each node, constrain their complexity to increase with tree depth, and grow multiple trees. The terminal nodes are labeled by estimates of the corresponding posterior distribution over shape classes. An image is classified by sending it down every tree and aggregating the resulting distributions. The method is applied to classifying handwritten digits and synthetic linear and nonlinear deformations of three hundred LATEX symbols. Stateof-the-art error rates are achieved on the National Institute of Standards and Technology database of digits. The principal goal of the experiments on LATEX symbols is to analyze invariance, generalization error and related issues, and a comparison with artificial neural networks methods is presented in this context. Neural Computation 9, 1545–1588 (1997)
c 1997 Massachusetts Institute of Technology °
1546
Yali Amit and Donald Geman
Figure 1: LATEX symbols.
1 Introduction We explore a new approach to shape recognition based on the joint induction of shape features and tree classifiers. The data are binary images of two-dimensional shapes of varying sizes. The number of shape classes may reach into the hundreds (see Figure 1), and there may be considerable within-class variation, as with handwritten digits. The fundamental problem is how to design a practical classification algorithm that incorporates the prior knowledge that the shape classes remain invariant under certain transformations. The proposed framework is analyzed within the context of invariance, generalization error, and other methods based on inductive learning, principally artificial neural networks (ANN). Classification is based on a large, in fact virtually infinite, family of binary
Shape Quantization and Recognition
1547
features of the image data that are constructed from local topographic codes (“tags”). A large sample of small subimages of fixed size is recursively partitioned based on individual pixel values. The tags are simply labels for the cells of each successive partition, and each pixel in the image is assigned all the labels of the subimage centered there. As a result, the tags do not involve detecting distinguished points along curves, special topological structures, or any other complex attributes whose very definition can be problematic due to locally ambiguous data. In fact, the tags are too primitive and numerous to classify the shapes. Although the mere existence of a tag conveys very little information, one can begin discriminating among shape classes by investigating just a few spatial relationships among the tags, for example, asking whether there is a tag of one type “north” of a tag of another type. Relationships are specified by coarse constraints on the angles of the vectors connecting pairs of tags and on the relative distances among triples of tags. No absolute location or scale constraints are involved. An image may contain one or more instances of an arrangement, with significant variations in location, distances, angles, and so forth. There is one binary feature (“query”) for each such spatial arrangement; the response is positive if a collection of tags consistent with the associated constraints is present anywhere in the image. Hence a query involves an extensive disjunction (ORing) operation. Two images that answer the same to every query must have very similar shapes. In fact, it is reasonable to assume that the shape class is determined by the full feature set; that is, the theoretical Bayes error rate is zero. But no classifier based on the full feature set can be evaluated, and it is impossible to determine a priori which arrangements are informative. Our approach is to select informative features and build tree classifiers (Breiman, Friedman, Olshen, & Stone, 1984; Casey & Nagy, 1984; Quinlan, 1986) at the same time by inductive learning. In effect, each tree provides an approximation to the full posterior where the features chosen depend on the branch that is traversed. There is a natural partial ordering on the queries that results from regarding each tag arrangement as a labeled graph, with vertex labels corresponding to the tag types and edge labels to angle and distance constraints (see Figures 6 and 7). In this way, the features are ordered according to increasing structure and complexity. A related attribute is semi-invariance, which means that a large fraction of those images of a given class that answer the same way to a given query will also answer the same way to any query immediately succeeding it in the ordering. This leads to nearly invariant classification with respect to many of the transformations that preserve shape, such as scaling, translation, skew and small, nonlinear deformations of the type shown in Figure 2. Due to the partial ordering, tree construction with an infinite-dimensional feature set is computationally efficient. During training, multiple trees (Breiman, 1994; Dietterich & Bakiri, 1995; Shlien, 1990) are grown, and a
1548
Yali Amit and Donald Geman
Figure 2: (Top) Perturbed LATEX symbols. (Bottom) Training data for one symbol.
form of randomization is used to reduce the statistical dependence from tree to tree; weak dependence is verified experimentally. Simple queries are used at the top of the trees, and the complexity of the queries increases with tree depth. In this way semi-invariance is exploited, and the space of shapes is systematically explored by calculating only a tiny fraction of the answers. Each tree is regarded as a random variable on image space whose values are the terminal nodes. In order to recognize shapes, each terminal node of each tree is labeled by an estimate of the conditional distribution over the shape classes given that an image reaches that terminal node. The estimates are simply relative frequencies based on training data and require no optimization. A new data point is classified by dropping it down each of the trees, averaging over the resulting terminal distributions, and taking the mode of this aggregate distribution. Due to averaging and weak dependence, considerable errors in these estimates can be tolerated. Moreover, since tree-growing (i.e., question selection) and parameter estimation can be separated, the estimates can be refined indefinitely without reconstruct-
Shape Quantization and Recognition
1549
ing the trees, simply by updating a counter in each tree for each new data point. The separation between tree making and parameter estimation, and the possibility of using different training samples for each phase, opens the way to selecting the queries based on either unlabeled samples (i.e., unsupervised learning) or samples from only some of the shape classes. Both of these perform surprisingly well compared with ordinary supervised learning. Our recognition strategy differs from those based on true invariants (algebraic, differential, etc.) or structural features (holes, endings, etc.). These methods certainly introduce prior knowledge about shape and structure, and we share that emphasis. However, invariant features usually require image normalization or boundary extraction, or both, and are generally sensitive to shape distortion and image degradation. Similarly, structural features can be difficult to express as well-defined functions of the image (as opposed to model) data. In contrast, our queries are stable and primitive, precisely because they are not truly invariant and are not based on distinguished points or substructures. A popular approach to multiclass learning problems in pattern recognition is based on ANNs, such as feedforward, multilayer perceptrons (Dietterich & Bakiri, 1995; Fukushima & Miyake, 1982; Knerr, Personnaz, & Dreyfus, 1992; Martin & Pitman, 1991). For example, the best rates on handwritten digits are reported in LeCun et al. (1990). Classification trees and neural networks certainly have aspects in common; for example, both rely on training data, are fast online, and require little storage (see Brown, Corruble, & Pittard, 1993; Gelfand & Delp, 1991). However, our approach to invariance and generalization is, by comparison, more direct in that certain properties are acquired by hardwiring rather than depending on learning or image normalization. With ANNs, the emphasis is on parallel and local processing and a limited degree of disjunction, in large part due to assumptions regarding the operation of the visual system. However, only a limited degree of invariance can be achieved with such models. In contrast, the features here involve extensive disjunction and more global processing, thus achieving a greater degree of invariance. This comparison is pursued in section 12. The article is organized as follows. Other approaches to invariant shape recognition are reviewed in section 2; synthesized random deformations of 293 basic LATEX symbols (see Figures 1 and 2) provide a controlled experimental setting for an empirical analysis of invariance in a high-dimensional shape space. The basic building blocks of the algorithm, namely the tags and the tag arrangements, are described in section 3. In section 4 we address the fundamental question of how to exploit the discriminating power of the feature set; we attempt to motivate the use of multiple decision trees in the context of the ideal Bayes classifier and the trade-off between approximation error and estimation error. In section 5 we explain the roles
1550
Yali Amit and Donald Geman
of the partial ordering and randomization for both supervised and unsupervised tree construction; we also discuss and quantify semi-invariance. Multiple decision trees and the full classification algorithm are presented in section 6, together with an analysis of the dependence on the training set. In section 7 we calculate some rough performance bounds, for both individual and multiple trees. Generalization experiments, where the training and test samples represent different populations, are presented in section 8, and incremental learning is addressed in section 9. Fast indexing, another possible role for shape quantization, is considered in section 10. We then apply the method in section 11 to a real problem—classifying handwritten digits—using the National Institute of Standards and Technology (NIST) database for training and testing, achieving state-of-the-art error rates. In section 12 we develop the comparison with ANNs in terms of invariance, generalization error, and connections to observed functions in the visual system. We conclude in section 13 by assessing extensions to other visual recognition problems.
2 Invariant Recognition Invariance is perhaps the fundamental issue in shape recognition, at least for isolated shapes. Some basic approaches are reviewed within the following framework. Let X denote a space of digital images, and let C denote a set of shape classes. Let us assume that each image x ∈ X has a true class label Y(x) ∈ C = {1, 2, . . . , K}. Of course, we cannot directly observe Y. In addition, there is a probability distribution P on X. Our goal is to construct ˆ X → C such that P(Yˆ 6= Y) is small. a classifier Y: In the literature on statistical pattern recognition, it is common to address some variation by preprocessing or normalization. Given x, and before estimating the shape class, one estimates a transformation ψ such that ψ(x) represents a standardized image. Finding ψ involves a sequence of procedures that brings all images to the same size and then corrects for translation, slant, and rotation by one of a variety of methods. There may also be some morphological operations to standardize stroke thickness (Bottou et al., 1994; Hastie, Buja, & Tibshirani, 1995). The resulting image is then classified by one of the standard procedures (discriminant analysis, multilayer neural network, nearest neighbors, etc.), in some cases essentially ignoring the global spatial properties of shape classes. Difficulties in generalization are often encountered because the normalization is not robust or does not accommodate nonlinear deformations. This deficiency can be ameliorated only with very large training sets (see the discussions in Hussain & Kabuka, 1994; Raudys & Jain, 1991; Simard, LeCun, & Denker, 1994; Werbos, 1991, in the context of neural networks). Still, it is clear that robust normalization
Shape Quantization and Recognition
1551
methods which reduce variability and yet preserve information can lead to improved performance of any classifier; we shall see an example of this in regard to slant correction for handwritten digits. Template matching is another approach. One estimates a transformation from x for each of the prototypes in the library. Classification is then based on the collection of estimated transformations. This requires explicit modeling of the prototypes and extensive computation at the estimation stage (usually involving relaxation methods) and appears impractical with large numbers of shape classes. A third approach, closer in spirit to ours, is to search for invariant functions 8(x), meaning that P(8(x) = φc |Y = c) = 1 for some constants φc , c = 1, . . . , K. The discriminating power of 8 depends on the extent to which the values φc are distinct. Many invariants for planar objects (based on single views) and nonplanar objects (based on multiple views) have been discovered and proposed for recognition (see Reiss, 1993, and the references therein). Some invariants are based on Fourier descriptors and image moments; for example, the magnitude of Zernike moments (Khotanzad & Lu, 1991) is invariant to rotation. Most invariants require computing tangents from estimates of the shape boundaries (Forsyth et al., 1991; Sabourin & Mitiche, 1992). Examples of such invariants include inflexions and discontinuities in curvature. In general, the mathematical level of this work is advanced, borrowing ideas from projective, algebraic, and differential geometry (Mundy & Zisserman, 1992). Other successful treatments of invariance include geometric hashing (Lamdan, Schwartz, & Wolfson, 1988) and nearest-neighbor classifiers based on affine invariant metrics (Simard et al., 1994). Similarly, structural features involving topological shape attributes (such as junctions, endings, and loops) or distinguished boundary points (such as points of high curvature) have some invariance properties, and many authors (e.g., Lee, Srihari, & Gaborski, 1991) report much better results with such features than with standardized raw data. In our view, true invariant features of the form above might not be sufficiently stable for intensity-based recognition because the data structures are often too crude to analyze with continuum-based methods. In particular, such features are not invariant to nonlinear deformations and depend heavily on preprocessing steps such as normalization and boundary extraction. Unless the data are of very high quality, these steps may result in a lack of robustness to distortions of the shapes, due, for example, to digitization, noise, blur, and other degrading factors (see the discussion in Reiss, 1993). Structural features are difficult to model and to extract from the data in a stable fashion. Indeed, it may be more difficult to recognize a “hole” than to recognize an “8.” (Similar doubts about hand-crafted features and distinguished points are expressed in Jung & Nagy, 1995.) In addition, if one could recognize the components of objects without recognizing the objects themselves, then the choice of classifier would likely be secondary.
1552
Yali Amit and Donald Geman
Our features are not invariant. However, they are semi-invariant in an appropriate sense and might be regarded as coarse substitutes for some of the true geometric, point-based invariants in the literature already cited. In this sense, we share at least the outlook expressed in recent, modelbased work on quasi-invariants (Binford & Levitt, 1993; Burns, Weiss, & Riseman, 1993), where strict invariance is relaxed; however, the functionals we compute are entirely different. The invariance properties of the queries are related to the partial ordering and the manner in which they are selected during recursive partitioning. Roughly speaking, the complexity of the queries is proportional to the depth in the tree, that is, to the number of questions asked. For elementary queries at the bottom of the ordering, we would expect that for each class c, either P(Q = 1|Y = c) À 0.5 or P(Q = 0|Y = c) À 0.5; however this collection of elementary queries would have low discriminatory power. (These statements will be amplified later on.) Queries higher up in the ordering have much higher discriminatory power and maintain semi-invariance relative to subpopulations determined by the answers to queries preceding them in the ordering. Thus if Q˜ is a query immediately preceding Q in the ordering, then P(Q = 1|Q˜ = 1, Y = c) À 0.5 or P(Q = 0|Q˜ = 1, Y = c) À 0.5 for each class c. This will be defined more precisely in section 5 and verified empirically. Experiments on invariant recognition are scattered throughout the article. Some involve real data: handwritten digits. Most employ synthetic data, in which case the data model involves a prototype x∗c for each shape class c ∈ C (see Figure 1) together with a space 2 of image-to-image transformations. We assume that the class label of the prototype is preserved under all transformations in 2, namely, c = Y(θ (x∗c )) for all θ ∈ 2, and that no two distinct prototypes can be transformed to the same image. We use “transformations” in a rather broad sense, referring to both affine maps, which alter the pose of the shapes, and to nonlinear maps, which deform the shapes. (We shall use degradation for noise, blur, etc.) Basically, 2 consists of perturbations of the identity. In particular, we are not considering the entire pose space but rather only perturbations of a reference pose, corresponding to the identity. The probability measure P on X is derived from a probability measure ν(dθ) on the space of transformations as follows: for any D ⊂ X, X X P(D|Y = c)π(c) = ν{θ : θ (x∗c ) ∈ D}π(c) P(D) = c
c
where π is a prior distribution on C , which we will always take to be uniform. Thus, P is concentrated on the space of images {θ (x∗c )}θ,c . Needless to say, the situation is more complex in many actual visual recognition problems, for example, in unrestricted 3D object recognition under standard projection models. Still, invariance is already challenging in the above context. It is important to emphasize that this model is not used explicitly in the
Shape Quantization and Recognition
1553
classification algorithm. Knowledge of the prototypes is not assumed, nor is θ estimated as in template approaches. The purpose of the model is to generate samples for training and testing. The images in Figure 2 were made by random sampling from a particular distribution ν on a space 2 containing both linear (scale, rotation, skew) and nonlinear transformations. Specifically, the log scale is drawn uniformly between −1/6 and 1/6; the rotation angle is drawn uniformly from ±10 degrees; and the log ratio of the axes in the skew is drawn uniformly from −1/3 to +1/3. The nonlinear part is a smooth, random deformation field constructed by creating independent, random horizontal and vertical displacements, each of which is generated by random trigonometric series with only low-frequency terms and gaussian coefficients. All images are 32 × 32, but the actual size of the object in the image varies significantly, both from symbol to symbol and within symbol classes due to random scaling. 3 Shape Queries We first illustrate a shape query in the context of curves and tangents in an idealized, continuum setting. The example is purely motivational. In practice we are not dealing with one-dimensional curves in the continuum but rather with a finite pixel lattice, strokes of variable width, corrupted data, and so forth. The types of queries we use are described in sections 3.1 and 3.2. Observe the three versions of the digit “3” in Figure 3 (left); they are obtained by spline interpolation of the center points of the segments shown in Figure 3 (middle) in such a way that the segments represent the direction of the tangent at those points. All three segment arrangements satisfy the geometric relations indicated in Figure 3 (right): there is a vertical tangent northeast of a horizontal tangent, which is south of another horizontal tangent, and so forth. The directional relations between the points are satisfied to within rather coarse tolerances. Not all curves of a “3” contain five points whose tangents satisfy all these relations. Put differently, some “3”s answer “no” to the query, “Is there a vertical tangent northeast of a . . . ?” However, rather substantial transformations of each of the versions below will answer “yes.” Moreover, among “3”s that answer “no,” it is possible to choose a small number of alternative arrangements in such a way that the entire space of “3”s is covered. 3.1 Tags. We employ primitive local features called tags, which provide a coarse description of the local topography of the intensity surface in the neighborhood of a pixel. Instead of trying to manually characterize local configurations of interest—for example, trying to define local operators to identify gradients in the various directions—we adopt an informationtheoretic approach and “code” a microworld of subimages by a process very similar to tree-structured vector quantization. In this way we sidestep the
1554
Yali Amit and Donald Geman
Figure 3: (Left) Three curves corresponding to the digit “3.” (Middle) Three tangent configurations determining these shapes via spline interpolation. (Right) Graphical description of relations between locations of derivatives consistent with all three configurations.
issues of boundary detection and gradients in the discrete world and allow for other forms of local topographies. This approach has been extended to gray level images in Jedynak and Fleuret (1996). The basic idea is to reassign symbolic values to each pixel based on examining a few pixels in its immediate vicinity; the symbolic values are the tag types and represent labels for the local topography. The neighborhood we choose is the 4 × 4 subimage containing the pixel at the upper left corner. We cluster the subimages with binary splits corresponding to adaptively choosing the five most informative locations of the sixteen sites of the subimage. Note that the size of the subimages used must depend on the resolution at which the shapes are imaged. The 4 × 4 subimages are appropriate for a certain range of resolutions—roughly 10 × 10 through 70 × 70 in our experience. The size must be adjusted for higher-resolution data, and the ultimate performance of the classifier will suffer if the resolution of the test data is not approximately the same as that of the training data. The best approach would be one that is multiresolution, something we have not done in this article (except for some preliminary experiments in section 11) but which is carried out in Jedynak and Fleuret (1996) in the context of gray-level images and 3D objects. A large sample of 4×4 subimages is randomly extracted from the training data. The corresponding shape classes are irrelevant and are not retained. The reason is that the purpose of the sample is to provide a representative database of microimages and to discover the biases at that scale; the statistics of that world is largely independent of global image attributes, such as symbolic labels. This family of subimages is then recursively partitioned with binary splits. There are 4 × 4 = 16 possible questions: “Is site (i, j) black?” for i, j = 1, 2, 3, 4. The criterion for choosing a question at a node
Shape Quantization and Recognition
1555
t is dividing the subimages Ut at the node as equally as possible into two groups. This corresponds to reducing as much as possible the entropy of the empirical distribution on the 216 possible binary configurations for the sample Ut . There is a tag type for each node of the resulting tree, except for the root. Thus, if three questions are asked, there are 2 + 4 + 8 = 14 tags, and if five questions are asked, there are 62 tags. Depth 5 tags correspond to a more detailed description of the local topography than depth 3 tags, although eleven of the sixteen pixels still remain unexamined. Observe also that tags corresponding to internal nodes of the tree represent unions of those associated with deeper ones. At each pixel, we assign all the tags encountered by the corresponding 4 × 4 subimage as it proceeds down the tree. Unless otherwise stated, all experiments below use 62 tags. At the first level, every site splits the population with nearly the same frequencies. However, at the second level, some sites are more informative than others, and by levels 4 and 5, there is usually one site that partitions the remaining subpopulation much better than all others. In this way, the world of microimages is efficiently coded. For efficiency, the population is restricted to subimages containing at least one black and one white site within the center four, which then obviously concentrates the processing in the neighborhood of boundaries. In the gray-level context it is also useful to consider more general tags, allowing, for example, for variations on the concept of local homogeneity. The first three levels of the tree are shown in Figure 4, together with the most common configuration found at each of the eight level 3 nodes. Notice that the level 1 tag alone (i.e., the first bit in the code) determines the original image, so this “transform” is invertible and redundant. In Figure 5 we show all the two-bit tags and three-bit tags appearing in an image. 3.2 Tag Arrangements. The queries involve geometric arrangements of the tags. A query QA asks whether a specific geometric arrangement A of tags of certain types is present (QA (x) = 1) or is not present (QA (x) = 0) in the image. Figure 6 shows several LATEX symbols that contain a specific geometric arrangement of tags: tag 16 northeast of tag 53, which is northwest of tag 19. Notice that there are no fixed locations in this description, whereas the tags in any specific image do carry locations. “Present in the image” means there is at least one set of tags in x of the prescribed types whose locations satisfy the indicated relationships. In Figure 6, notice, for example, how different instances of the digit “0” still contain the arrangement. Tag 16 is a depth 4 tag; the corresponding four questions in the subimage are indicated by the following mask: n
n n 1 0 n n n n 0 0 n n n n n
1556
Yali Amit and Donald Geman
Figure 4: First three tag levels with most common configurations.
Figure 5: (Top) All instances of the four two-bit tags. (Bottom) All instances of the eight three-bit tags.
where 0 corresponds to background, 1 to object, and n to “not asked.” These neighborhoods are loosely described by “background to lower left, object to upper right.” Similar interpretations can be made for tags 53 and 19. Restricted to the first ten symbol classes (the ten digits), the conditional distribution P(Y = c|QA = 1) on classes given the existence of this arrangement in the image is given in Table 1. Already this simple query contains significant information about shape.
Shape Quantization and Recognition
1557
Figure 6: (Top) Instances of a geometric arrangement in several “0”s. (Bottom) Several instances of the geometric arrangement in one “6.” Table 1: Conditional Distribution on Digit Classes Given the Arrangement of Figure 6. 0
1
2
3
4
5
6
7
8
9
.13
.003
.03
.08
.04
.07
.23
0
.26
.16
To complete the construction of the feature set, we need to define a set of allowable relationships among image locations. These are binary functions of pairs, triples, and so forth of planar points, which depend on only their relative coordinates. An arrangement A is then a labeled (hyper)graph. Each vertex is labeled with a type of tag, and each edge (or superedge) is labeled with a type of relation. The graph in Figure 6, for example, has only binary relations. In fact, all the experiments on the LATEX symbols are restricted to this setting. The experiments on handwritten digits also use a ternary relationship of the metric type. There are eight binary relations between any two locations u and v corresponding to the eight compass headings (north, northeast, east, etc.). For example, u is “north” of v if the angle of the vector u − v is between π/4 and 3π/4. More generally, the two points satisfy relation k (k = 1, . . . , 8) if the
1558
Yali Amit and Donald Geman
angle of the vector u − v is within π/4 of k ∗ π/4. Let A denote the set of all possible arrangements, and let Q = {QA : A ∈ A}, our feature set. There are many other binary and ternary relations that have discriminating power. For example, there is an entire family of “metric” relationships that are, like the directional relationships above, completely scale and translation invariant. Given points u, v, w, z, one example of a ternary relation is ku − vk < ku − wk, which inquires whether u is closer to v than to w. With four points we might ask if ku − vk < kw − zk. 4 The Posterior Distribution and Tree-Based Approximations For simplicity, and in order to facilitate comparisons with other methods, we restrict ourselves to queries QA of bounded complexity. For example, consider arrangements A with at most twenty tags and twenty relations; this limit is never exceeded in any of the experiments. Enumerating these arrangements in some fashion, let Q = (Q1 , . . . , QM ) be the corresponding feature vector assuming values in {0, 1}M . Each image x then generates a bit string of length M, which contains all the information available for estimating Y(x). Of course, M is enormous. Nonetheless, it is not evident how we might determine a priori which features are informative and thereby reduce M to manageable size. Evidently these bit strings partition X. Two images that generate the same bit string or “atom” need not be identical. Indeed, due to the invariance properties of the queries, the two corresponding symbols may vary considerably in scale, location, and skew and are not even affine equivalent in general. Nonetheless, two such images will have very similar shapes. As a result, it is reasonable to expect that H(Y|Q) (the conditional entropy of Y given Q) is very small, in which case we can in principle obtain high classification rates using Q. To simplify things further, at least conceptually, we will assume that H(Y|Q) = 0; this is not an unreasonable assumption for large M. An equivalent assumption is that the shape class Y is determined by Q and the error rate of the Bayes classifier Yˆ B = arg max P(Y = c|Q) c
is zero. Needless to say, perfect classification cannot actually be realized. Due to the size of M, the full posterior cannot be computed, and the classifier Yˆ B is only hypothetical. Suppose we examine some of the features by constructing a single binary tree T based on entropy-driven recursive partitioning and randomization and that T is uniformly of depth D so that D of the M features are examined for each image x. (The exact procedure is described in the following section; the details are not important for the moment.) Suffice it to say that a feature Qm is assigned to each interior node of T and the set of features Qπ1 , . . . , QπD
Shape Quantization and Recognition
1559
along each branch from root to leaf is chosen sequentially and based on the current information content given the observed values of the previously chosen features. The classifier based on T is then Yˆ T = arg max P(Y = c|T) c
= arg max P(Y = c|Qπ1 , . . . , QπD ) c
since D ¿ M, Yˆ T is not the Bayes classifier. However, even for values of D on the order of hundreds or thousands, we can expect that P(Y = c|T) ≈ P(Y = c|Q). We shall refer to the difference between these distributions (in some appropriate norm) as the approximation error (AE). This is one of the sources of error in replacing Q by a subset of features. Of course, we cannot actually compute a tree of such depth since at least several hundred features are needed to achieve good classification; we shall return to this point shortly. Regardless of the depth D, in reality we do not actually know the posterior distribution P(Y = c|T). Rather, it must be estimated from a training set L = {(x1 , Y(x1 )), . . . , (xm , Y(xm ))}, where x1 , . . . , xm is a random sample from P. (The training set is also used to estimate the entropy values during recursive partitioning.) Let PˆL (Y = c|T) denote the estimated distribution, obtained by simply counting the number of training images of each class c that land at each terminal node of T. If L is sufficiently large, then Pˆ L (Y = c|T) ≈ P(Y = c|T). We call the difference estimation error (EE), which of course vanishes only as |L| → ∞. The purpose of multiple trees (see section 6) is to solve the approximation error problem and the estimation error problem at the same time. Even if we could compute and store a very deep tree, there would still be too many probabilities (specifically K2D ) to estimate with a practical training set L. Our approach is to build multiple trees T1 , . . . , TN of modest depth. In this way tree construction is practical and Pˆ L (Y = c|Tn ) ≈ P(Y = c|Tn ), n = 1, . . . , N. Moreover, the total number of features examined is sufficiently large to control the approximation error. The classifier we propose is N 1 X PˆL (Y = c|Tn ). Yˆ S = arg max c N n=1
An explanation for this particular way of aggregating the information from multiple trees is provided in section 6.1. In principle, a better way to combine the trees would be to classify based on the mode of P(Y = c|T1 , . . . , TN ).
1560
Yali Amit and Donald Geman
However, this is impractical for reasonably sized training sets for the same reasons that a single deep tree is impractical (see section 6.4 for some numerical experiments). The trade-off between AE and EE is related to the trade-off between bias and variance, which is discussed in section 6.2, and the relative error rates among all these classifiers is analyzed in more detail in section 6.4 in the context of parameter estimation. 5 Tree-Structured Shape Quantization Standard decision tree construction (Breiman et al., 1984; Quinlan, 1986) is based on a scalar-valued feature or attribute vector z = (z1 , . . . , zk ) where k is generally about 10 − 100. Of course, in pattern recognition, the raw data are images, and finding the right attributes is widely regarded as the main issue. Standard splitting rules are based on functions of this vector, usually involving a single component zj (e.g., applying a threshold) but occasionally involving multivariate functions or “transgenerated features” (Friedman, 1973; Gelfand & Delp, 1991; Guo & Gelfand, 1992; Sethi, 1991). In our case, the queries {QA } are the candidates for splitting rules. We now describe the manner in which the queries are used to construct a tree. 5.1 Exploring Shape Space. Since the set of queries Q is indexed by graphs, there is a natural partial ordering under which a graph precedes any of its extensions. The partial ordering corresponds to a hierarchy of structure. Small arrangements with few tags produce coarse splits of shape space. As the arrangements increase in size (say, the number of tags plus relations), they contain more and more information about the images that contain them. However, fewer and fewer images contain such an instance— that is, P(Q = 1) ≈ 0 for a query Q based on a complex arrangement. One straightforward way to exploit this hierarchy is to build a decision tree using the collection Q as candidates for splitting rules, with the complexity of the queries increasing with tree depth (distance from the root). In order to begin to make this computationally feasible, we define a minimal extension of an arrangement A to mean the addition of exactly one relation between existing tags, or the addition of exactly one tag and one relation binding the new tag to an existing one. By a binary arrangement, we mean one with two tags and one relation; the collection of associated queries is denoted B ⊂ Q. Now build a tree as follows. At the root, search through B and choose the query Q ∈ B, which leads to the greatest reduction in the mean uncertainty about Y given Q. This is the standard criterion for recursive partitioning in machine learning and other fields. Denote the chosen query QA0 . Those data points for which QA0 = 0 are in the “no” child node, and we search again through B. Those data points for which QA0 = 1 are in the “yes” child node and have one or more instances of A0 , the “pending arrangement.” Now search among minimal extensions of A0 and choose the one that leads
Shape Quantization and Recognition
1561
Figure 7: Examples of node splitting. All six images lie in the same node and have a pending arrangement with three vertices. The “0”s are separated from the “3”s and “5”s by asking for the presence of a new tag, and then the “3”s and “5”s are separated by asking a question about the relative angle between two existing vertices. The particular tags associated with these vertices are not indicated.
to the greatest reduction in uncertainty about Y given the existence of A0 . The digits in Figure 6 were taken from a depth 2 (“yes”/“yes”) node of such a tree. We measure uncertainty by Shannon entropy. The expected uncertainty in Y given a random variable Z is X X P(Z = z) P(Y = c|Z = z) log2 P(Y = c|Z = z). H(Y|Z) = − z
c
Define H(Y|Z, B) for an event B ⊂ X in the same way, except that P is replaced by the conditional probability measure P(.|B). Given we are at a node t of depth k > 0 in the tree, let the “history” be Bt = {QA0 = q0 , . . . , QAk−1 = qk−1 }, meaning that QA1 is the second query chosen given that q0 ∈ {0, 1} is the answer to the first; QA2 is the third query chosen given the answers to the first two are q0 and q1 ; and so forth. The pending arrangement, say Aj , is the deepest arrangement along the path from root to t for which qj = 1, so that qi = 0, i = j + 1, . . . , k − 1. Then QAk minimizes H(Y|QA , Bt ) among minimal extensions of Aj . An example of node splitting is shown in Figure 7. Continue in this fashion until a stopping criterion is satisfied, for example, the number of data points at every terminal node falls below a threshold. Each tree may then be regarded as a discrete random variable T on X; each terminal node corresponds to a different value of T. In practice, we cannot compute these expected entropies; we can only estimate them from a training set L. Then P is replaced by the empirical distribution PˆL on {x1 , . . . , xm } in computing the entropy values. 5.2 Randomization. Despite the growth restrictions, the procedure above is still not practical; the number of binary arrangements is very large, and
1562
Yali Amit and Donald Geman
there are too many minimal extensions of more complex arrangements. In addition, if more than one tree is made, even with a fresh sample of data points per tree, there might be very little difference among the trees. The solution is simple: instead of searching among all the admissible queries at each node, we restrict the search to a small random subset. 5.3 A Structural Description. Notice that only connected arrangements can be selected, meaning every two tags are neighbors (participate in a relation) or are connected by a sequence of neighboring tags. As a result, training is more complex than standard recursive partitioning. At each node, a list must be assigned to each data point consisting of all instances of the pending arrangement, including the coordinates of each participating tag. If a data point passes to the “yes” child, then only those instances that can be incremented are maintained and updated; the rest are deleted. The more data points there are, the more bookkeeping. A far simpler possibility is sampling exclusively from B, the binary arrangements (i.e., two vertices and one relation) listed in some order. In fact, we can imagine evaluating all the queries in B for each data point. This vector could then be used with a variety of standard classifiers, including decision trees built in the standard fashion. In the latter case, the pending arrangements are unions of binary graphs, each one disconnected from all the others. This approach is much simpler and faster to implement and preserves the semi-invariance. However, the price is dear: losing the common, global characterization of shape in terms of a large, connected graph. Here we are referring to the pending arrangements at the terminal nodes (except at the end of the all “no” branch); by definition, this graph is found in all the shapes at the node. This is what we mean by a structural description. The difference between one connected graph and a union of binary graphs can be illustrated as follows. Relative to the entire population X, a random selection in B is quite likely to carry some information about Y, measured, say, by the mutual information I(Y, Q) = H(Y)−H(Y|Q). On the other hand, a random choice among all queries with, say, five tags will most likely have no information because nearly all data points x will answer “no.” In other words, it makes sense at least to start with binary arrangements. Assume, however, that we are restricted to a subset {QA = 1} ⊂ X determined by an arrangement A of moderate complexity. (In general, the subsets at the nodes are determined by the “no” answers as well as the “yes” answers, but the situation is virtually the same.) On this small subset, a randomly sampled binary arrangement will be less likely to yield a significant drop in uncertainty than a randomly sampled query among minimal extensions of A. These observations have been verified experimentally, and we omit the details. This distinction becomes more pronounced if the images are noisy (see the top panel of Figure 8) or contain structured backgrounds (see the bottom panel of Figure 11) because there will be many false positives for ar-
Shape Quantization and Recognition
1563
Figure 8: Samples from data sets. (Top) Spot noise. (Middle) Duplication. (Bottom) Severe perturbations.
rangements with only two tags. However, the chance of finding complex arrangements utilizing noise tags or background tags is much smaller. Put differently, a structural description is more robust than a list of attributes. The situation is the same for more complex shapes; see, for example, the middle panel of Figure 8, where the shapes were created by duplicating each symbol four times with some shifts. Again, a random choice among minimal extensions carries much more information than a random choice in B. 5.4 Semi-Invariance. Another benefit of the structural description is what we refer to as semi-invariance. Given a node t, let Bt be the history and A j the pending arrangement. For any minimal extension A of A j , and for any shape class c, we want max(P(QA = 0|Y = c, Bt ), P(QA = 1|Y = c, Bt ) À .5 . In other words, most of the images in Bt of the same class should answer the same way to query QA . In terms of entropy, semi-invariance is equivalent to relatively small values of H(QA |Y = c, Bt ) for all c. Averaging over classes, this in turn is equivalent to small values of H(QA |Y, Bt ) at each node t. In order to verify this property we created ten trees of depth 5 using the data set described in section 2 with thirty-two samples per symbol class.
1564
Yali Amit and Donald Geman
At each nonterminal node t of each tree, the average value of H(QA |Y, Bt ) was calculated over twenty randomly sampled minimal extensions. Over all nodes, the mean entropy was m = .33; this is the entropy of the distribution (.06, .94). The standard deviation over all nodes and queries was σ = .08. Moreover, there was a clear decrease in average entropy (i.e., increase in the degree of invariance) as the depth of the node increases. We also estimated the entropy for more severe deformations. On a more variable data set with approximately double the range of rotations, log scale, and log skew (relative to the values in section 2), and the same nonlinear deformations, the corresponding numbers were m = .38, σ = .09. Finally for rotations sampled from (−30, 30) degrees, log scale from (−.5, .5), log skew from (−1, 1), and doubling the variance of the random nonlinear deformation (see the bottom panel of Figure 8), the corresponding mean entropy was m = .44 (σ = .11), corresponding to a (.1, .9) split. In other words, on average, 90 percent of the images in the same shape class still answer the same way to a new query. Notice that invariance property is independent of the discriminating power of the query, that is, the extent to which the distribution P(Y = c|Bt , QA ) is more peaked than the distribution P(Y = c|Bt ). Due to the symmetry of mutual information, H(Y|Bt ) − H(Y|QA , Bt ) = H(QA |Bt ) − H(QA |Y, Bt ). This means that if we seek a question that maximizes the reduction in the conditional entropy of Y and assume the second term on the right is small due to semi-invariance, then we need only find a query that maximizes H(QA |Bt ). This, however, does not involve the class variable and hence points to the possibility of unsupervised learning, which is discussed in the following section. 5.5 Unsupervised Learning. We outline two ways to construct trees in an unsupervised mode, that is, without using the class labels Y(xj ) of the samples xj in L. Clearly each query Qm decreases uncertainty about Q, and hence about Y. Indeed, H(Y|Qm ) ≤ H(Q|Qm ) since we are assuming Y is determined by Q. More generally, if T is a tree based on some of the components of Q and if H(Q|T) ¿ H(Q), then T should contain considerable information about the shape class. Recall that in the supervised mode, the query Qm chosen at node t minimizes H(Y|Bt , Qm ) (among a random sample of admissible queries), where Bt is the event in X corresponding to the answers to the previous queries. Notice that typically this is not equivalent to simply maximizing the information content of Qm because H(Y|Bt , Qm ) = H(Y, Qm |Bt ) − H(Qm |Bt ), and both terms depend on m. However, in the light of the discussion in the preceding section about semi-invariance, the first term can be ignored, and we can focus on maximizing the second term.
Shape Quantization and Recognition
1565
Another way to motivate this criterion is to replace Y by Q, in which case H(Q|Bt , Qm ) = H(Q, Qm |Bt ) − H(Qm |Bt ) = H(Q|Bt ) − H(Qm |Bt ). Since the first term is independent of m, the query of choice will again be the one maximizing H(Qm |Bt ). Recall that the entropy values are estimated from training data and that Qm is binary. It follows that growing a tree aimed at reducing uncertainty about Q is equivalent to finding at each node that query which best splits the data at the node into two equal parts. This results from the fact that maximizing H(p) = p log2 (p) + (1 − p) log2 (1 − p) reduces to minimizing |p − .5|. In this way we generate shape quantiles or clusters ignoring the class labels. Still, the tree variable T is highly correlated with the class variable Y. This would be the case even if the tree were grown from samples representing only some of the shape classes. In other words, these clustering trees produce a generic quantization of shape space. In fact, the same trees can be used to classify new shapes (see section 9). We have experimented with such trees, using the splitting criterion described above as well as another unsupervised one based on the “question metric,” dQ (x, x0 ) =
M 1 X δ(Qm (x) 6= Qm (x0 )), x, x0 ∈ X M m=1
where δ(. . .) = 1 if the statement is true and δ(. . .) = 0 otherwise. Since Q leads to Y, it makes sense to divide the data so that each child is as homogeneous as possible with respect to dQ ; we omit the details. Both clustering methods lead to classification rates that are inferior to those obtained with splits determined by separating classes but still surprisingly high; one such experiment is reported in section 6.1. 6 Multiple Trees We have seen that small, random subsets of the admissible queries at any node invariably contain at least one query that is informative about the shape class. What happens if many such trees are constructed using the same training set L? Because the family Q of queries is so large and because different queries—tag arrangements—address different aspects of shape, separate trees should provide separate structural descriptions, characterizing the shapes from different “points of view.” This is illustrated in Figure 9, where the same image is shown with an instance of the pending graph at the terminal node in five different trees. Hence, aggregating the information provided by a family of trees (see section 6.1) should yield more accurate and more robust classification. This will be demonstrated in experiments throughout the remainder of the article.
1566
Yali Amit and Donald Geman
Figure 9: Graphs found in an image at terminal nodes of five different trees.
Generating multiple trees by randomization was proposed in Geman, Amit, & Wilder (1996). Previously, other authors had advanced other methods for generating multiple trees. One of the earliest was weighted voting trees (Casey & Jih, 1983); Shlien (1990) uses different splitting criteria; Breiman (1994) uses bootstrap replicates of L; and Dietterich and Bakiri (1995) introduce the novel idea of replacing the multiclass learning problem by a family of two-class problems, dedicating a tree to each of these. Most of these articles deal with fixed-size feature vectors and coordinate-based questions. All authors report gains in accuracy and stability. 6.1 Aggregation. Suppose we are given a family of trees T1 , . . . , TN . The best classifier based on these is Yˆ A = arg max P(Y = c|T1 , . . . , TN ), c
but this is not feasible (see section 6.4). Another option would be to regard the trees as high-dimensional inputs to standard classifiers. We tried that with classification trees, linear and nonlinear discriminant analysis, K-means clustering, and nearest neighbors, all without improvement over simple averaging for the amount of training data we used. By averaging, we mean the following. Let µn,τ (c) denote the posterior distribution P(Y = c|Tn = τ ), n = 1, . . . , N, c = 1, . . . , K, where τ denotes a terminal node. We write µTn for the random variable µn,Tn . These probabilities are the parameters of the system, and the problem of estimating them will be discussed in section 6.4. Define µ(x) ¯ =
N 1 X µT (x) , N n=1 n
the arithmetic average of the distributions at the leaves reached by x. The mode of µ(x) ¯ is the class assigned to the data point x, that is, Yˆ S = arg max µ¯ c . c
Shape Quantization and Recognition
1567
Using a training database of thirty-two samples per symbol from the distribution described in section 2, we grew N = 100 trees of average depth d = 10, and tested the performance on a test set of five samples per symbol. The classification rate was 96 percent. This experiment was repeated several times with very similar results. On the other hand, growing one hundred unsupervised trees of average depth 11 and using the labeled data only to estimate the terminal distributions, we achieved a classification rate of 94.5 percent. 6.2 Dependence on the Training Set. The performance of classifiers constructed from training samples can be adversely affected by overdependence on the particular sample. One way to measure this is to consider the population of all training sets L of a particular size and to compute, for each data point x, the average EL eL (x), where eL denotes the error at x for the classifier made with L. (These averages may then be further averaged over X.) The average error decomposes into two terms, one corresponding to bias and the other to variance (Geman, Bienenstock, & Doursat, 1992). Roughly speaking, the bias term captures the systematic errors of the classifier design, and the variance term measures the error component due to random fluctuations from L to L. Generally parsimonious designs (e.g., those based on relatively few unknown parameters) yield low variance but highly biased decision boundaries, whereas complex nonparametric classifiers (e.g., neural networks with many parameters) suffer from high variance, at least without enormous training sets. Good generalization requires striking a balance. (See Geman et al., 1992, for a comprehensive treatment of the bias/variance dilemma; see also the discussions in Breiman, 1994; Kong & Dietterich, 1995; and Raudys & Jain, 1991.) One simple experiment was carried out to measure the dependence of our classifier Yˆ S on the training sample; we did not systematically explore the decomposition mentioned above. We made ten sets of twenty trees from ten different training sets, each consisting of thirty-two samples per symbol. The average classification rate was 85.3 percent; the standard deviation was 0.8 percent. Table 2 shows the number of images in the test set correctly labeled by j of the classifiers, j = 0, 1, . . . , 10. For example, we see that 88 percent of the test points are correctly labeled at least six out of ten times. Taking the plurality of the ten classifiers improves the classification rate to 95.5 percent so there is some pointwise variability among the classifiers. However, the decision boundaries and overall performance are fairly stable with respect to L. We attribute the relatively small variance component to the aggregation of many weakly dependent trees, which in turn results from randomization. The bias issue is more complex, and we have definitely noticed certain types of structural errors in our experiments with handwritten digits from the NIST database; for example, certain styles of writing are systematically misclassified despite the randomization effects.
1568
Yali Amit and Donald Geman
Table 2: Number of Points as a Function of the Number of Correct Classifiers. Number of correct classifiers 0 1 2 3 4 5 6 7 8 9 10 Number of points 9 11 20 29 58 42 59 88 149 237 763
6.3 Relative Error Rates. Due to estimation error, we favor many trees of modest depth over a few deep ones, even at the expense of theoretically higher error rates where perfect estimation is possible. In this section, we analyze those error rates for some of the alternative classifiers discussed above in the asymptotic case of infinite data and assuming the total number of features examined is held fixed, presumably large enough to guarantee low approximation error. The implications for finite data are outlined in section 6.4. Instead of making N trees T1 , . . . , TN of depth D, suppose we made just one tree T∗ of depth ND; in both cases we are asking ND questions. Of course this is not practical for the values of D and N mentioned above (e.g., D = 10, N = 20), but it is still illuminating to compare the hypothetical performance of the two methods. Suppose further that the criterion for selecting T∗ is to minimize the error rate over all trees of depth ND: T∗ = arg max E[max P(Y = c|T)], T
c
where the maximum is over all trees of depth ND. The error rate of the corresponding classifier Yˆ ∗ = arg maxc P(Y = c|T∗ ) is then e(Yˆ ∗ ) = 1 − E[maxc P(Y = c|T∗ )]. Notice that finding T∗ would require the solution of a global optimization problem that is generally intractable, accounting for the nearly universal adoption of greedy tree-growing algorithms based on entropy reduction, such as the one we are using. Notice also that minimizing ˆ the entropy H(Y|T) or the error rate P(Y 6= Y(T)) amounts to basically the same thing. Let e(Yˆ A ) and e(Yˆ S ) be the error rates of Yˆ A and Yˆ S (defined in 6.1), respectively. Then it is easy to show that e(Yˆ ∗ ) ≤ e(Yˆ A ) ≤ e(Yˆ S ). The first inequality results from the observation that the N trees of depth D could be combined into one tree of depth ND simply by grafting T2 onto each terminal node of T1 , then grafting T3 onto each terminal node of the new tree, and so forth. The error rate of the tree so constructed is just e(Yˆ A ). However, the error rate of T∗ is minimal among all trees of depth ND, and hence is lower than e(Yˆ A ). Since Yˆ S is a function of T1 , . . . , TN , the second inequality follows from a standard argument:
Shape Quantization and Recognition
1569
P(Y 6= Yˆ S ) = E[P(Y 6= Yˆ S |T1 , . . . , TN )] ≥ E[P(Y 6= arg max P(Y = c|T1 , . . . , TN ))|T1 , . . . , TN )] = P(Y 6= Yˆ A ).
c
6.4 Parameter Estimation. In terms of tree depth, the limiting factor is parameter estimation, not computation or storage. The probabilities P(Y = c|T∗ ), P(Y = c|T1 , . . . , TN ), and P(Y = c|Tn ) are unknown and must be estimated from training data. In each of the cases Yˆ∗ and Yˆ A , there are K × 2ND parameters to estimate (recall K is the number of shape classes), whereas for Yˆ S there are K × N × 2D parameters. Moreover, the number of data points in L available per parameter is kLk/(K2ND ) in the first two cases and kLk/(K2D ) with aggregation. For example, consider the family of N = 100 trees described in section 6.1, which were used to classify the K = 293 LATEX symbols. Since the average depth is D = 8, then there are approximately 100 × 28 × 293 ∼ 7.5 × 106 parameters, although most of these are nearly zero. Indeed, in all experiments reported below, only the largest five elements of µn,τ are estimated; the rest are set to zero. It should be emphasized, however, that the parameter estimates can be refined indefinitely using additional samples from X, a form of incremental learning (see section 9). For Yˆ A = arg maxc P(Y = c|T1 , . . . , TN ) the estimation problem is overwhelming, at least without assuming conditional independence or some other model for dependence. This was illustrated when we tried to compare the magnitudes of e(Yˆ A ) with e(Yˆ S ) in a simple case. We created N = 4 trees of depth D = 5 to classify just the first K = 10 symbols, which are the ten digits. The trees were constructed using a training set L with 1000 samples per symbol. Using Yˆ S , the error rate on L was just under 6 percent; on a test set V of 100 samples per symbol, the error rate was 7 percent. Unfortunately L was not large enough to estimate the full posterior given the four trees. Consequently, we tried using 1000, 2000, 4000, 10,000, and 20,000 samples per symbol for estimation. With two trees, the error rate was consistent from L to V , even with 2000 samples per symbol, and it was slightly lower than e(Yˆ S ). With three trees, there was a significant gap between the (estimated) e(Yˆ A ) on L and V , even with 20,000 samples per symbol; the estimated value of e(Yˆ A ) on V was 6 percent compared with 8 percent for e(Yˆ S ). With four trees and using 20,000 samples per symbol, the estimate of e(Yˆ A ) on V was about 6 percent, and about 1 percent on L. It was only 1 percent better than e(Yˆ S ), which was 7 percent and required only 1000 samples per symbol. We did not go beyond 20,000 samples per symbol. Ultimately Yˆ A will do better, but the amount of data needed to demonstrate this is prohibitive, even for four trees. Evidently the same problems would be encountered in trying to estimate the error rate for a very deep tree.
1570
Yali Amit and Donald Geman
7 Performance Bounds We divide this into two cases: individual trees and multiple trees. Most of the analysis for individual trees concerns a rather ideal case (twenty questions) in which the shape classes are atomic; there is then a natural metric on shape classes, and one can obtain bounds on the expected uncertainty after a given number of queries in terms of this metric and an initial distribution over classes. The key issue for multiple trees is weak dependence, and the analysis there is focused on the dependence structure among the trees. 7.1 Individual Trees: Twenty Questions. Suppose first that each shape class or hypothesis c is atomic, that is, it consists of a single atom of Q (as defined in section 4). In other words each “hypothesis” c has a unique code word, which we denote by Q(c) = (Q1 (c), . . . , QM (c)), so that Q is determined by Y. This setting corresponds exactly to a mathematical version of the twenty questions game. There is also an initial distribution ν(c) = P(Y = c). For each c = 1, . . . , K, the binary sequence (Qm (1), . . . , Qm (K)) determines a subset of hypotheses—those that answer yes to query Qm . Since the code words are distinct, asking enough questions will eventually determine Y. The mathematical problem is to find the ordering of the queries that minimizes the mean number of queries needed to determine Y or the mean uncertainty about Y after a fixed number of queries. The best-known example is when there is a query for every subset of {1, . . . , K}, so that M = 2K . The optimal strategy is given by the Huffman code, in which case the mean number of queries required to determine Y lies in the interval [H(Y), H(Y) + 1) (see Cover & Thomas, 1991). Suppose π1 , . . . , πk represent the indices of the first k queries. The mean residual uncertainty about Y after k queries is then H(Y|Qπ1 , . . . , Qπk ) = H(Y, Qπ1 , . . . , Qπk ) − H(Qπ1 , . . . , Qπk ) = H(Y) − H(Qπ1 , . . . , Qπk ) ¡ = H(Y) − H(Qπ1 ) + H(Qπ2 |Qπ1 )
¢ + · · · + H(Qπk |Qπ1 , . . . , Qπk−1 ) .
Consequently, if at each stage there is a query that divides the active hypotheses into two groups such that the mass of the smaller group is at least β (0 < β ≤ .5), then H(Y|Qπ1 , . . . , Qπk ) ≤ H(Y) − kH(β). The mean decision time is roughly H(Y)/H(β). In all unsupervised trees we produced, we found H(Qπk |Qπ1 , . . . , Qπk−1 ) to be greater than .99 (corresponding to β ≈ .5) at 95 percent of the nodes. If assumptions are made about the degree of separation among the code words, one can obtain bounds on mean decision times and the expected uncertainty after a fixed number of queries, in terms of the prior distribution ν. For these types of calculations, it is easier to work with the Hellinger
Shape Quantization and Recognition
1571
measure of uncertainty than with Shannon entropy. Given a probability vector p = (p1 , . . . , pJ ), define Xp √ pj pi , G(p) = j6=i
and define G(Y), G(Y|Bt ), and G(Y|Bt , Qm ) the same way as with the entropy function H. (G and H have similar properties; for example, G is minimized on a point mass, maximized on the uniform distribution, and it follows from Jensen’s inequality that H(p) ≤ log2 [G(p) + 1].) The initial amount of uncertainty is X ν 1/2 (c)ν 1/2 (c0 ). G(Y) = c6=c0
For any subset {m1 , . . . , mk } ⊂ {1, . . . , M}, using Bayes rule and the fact that P(Q|Y) is either 0 or 1, we obtain G(Y|Qm1 , . . . , Qmk ) =
k XY c6=c0
δ(Qmi (c) = Qmi (c0 ))ν 1/2 (c)ν 1/2 (c0 ).
i=1
Now suppose we average G(Y|Qm1 , . . . , Qmk ) over all subsets {m1 , . . . , mk } (allowing repetition). The average is X
M−k
G(Y|Qm1 , . . . , Qmk ) =
(m1 ,...,mk )
X c6=c0
M−k
X
k Y
(m1 ,...,mk ) i=1
× δ(Qmi (c) = Qmi (c0 ))ν 1/2 (c)ν 1/2 (c0 ) X (1 − dQ (c, c0 ))k ν 1/2 (c)ν 1/2 (c0 ). = c6=c0
Consequently, any better-than-average subset of queries satisfies X (1 − dQ (c, c0 ))k ν 1/2 (c)ν 1/2 (c0 ). G(Y|Qm1 , . . . , Qmk ) ≤ c6=c0
If γ = minc,c0 dQ (c, c0 ), then the residual uncertainty is at most (1 − γ )k G(Y). In order to disambiguate K hypotheses under a uniform starting distribution (in which case G(Y) = K − 1) we would need approximately k≈−
log K log(1 − γ )
queries, or k ≈ log K/γ for small γ . (This is clear without the general inequality above, since we eliminate a fraction γ of the remaining hypotheses with each new query.) This value of k is too large to be practical for realistic values of γ (due to storage, etc.) but does express the divide-and-conquer nature
1572
Yali Amit and Donald Geman
of recursive partitioning in the logarithmic dependence on the number of hypotheses. Needless to say, the compound case is the only realistic one, where the number of atoms in a shape class is a measure of its complexity. (For example, we would expect many more atoms per handwritten digit class than per printed font class.) In the compound case, one can obtain results similar to those mentioned above by considering the degree of homogeneity within classes as well as the degree of separation between classes. For example, the index γ must be replaced by one based on both the maximum distance Dmax between code words of the same class and the minimum distance Dmin between code words from different classes. Again, the bounds obtained call for trees that are too deep actually to be made, and much deeper than those that are empirically demonstrated to obtain good discrimination. We achieve this in practice due to semi-invariance, guaranteeing that Dmax is small, and the extraordinary richness of the world of spatial relationships, guaranteeing that Dmin is large. 7.2 Multiple Trees: Weak Dependence. From a statistical perspective, randomization leads to weak conditional dependence among the trees. For example, given Y = c, the correlation between two trees T1 and T2 is small. In other words, given the class of an image, knowing the leaf of T1 that is reached would not aid us in predicting the leaf reached in T2 . In this section, we analyze the dependence structure among the trees and obtain a crude lower bound on the performance of the classifier Yˆ S for a fixed family of trees T1 , . . . , TN constructed from a fixed training set L. Thus we are not investigating the asymptotic performance of Yˆ S as either N → ∞ or |L| → ∞. With infinite training data, a tree could be made arbitrarily deep, leading to arbitrarily high classification rates since nonparametric classifiers are generally strongly consistent. ¯P . . . , Ec µ(K)) ¯ denote the mean of µ¯ conditioned on Let Ec µ¯ = (Ec µ(1), ¯ = N1 N Y = c: Ec µ(d) i=1 E(µTn (d)|Y = c). We make three assumptions about the mean vector, all of which turn out to be true in practice: ¯ = c. 1. arg maxd Ec µ(d) ¯ = αc >> 1/K. 2. Ec µ(c) ¯ ∼ (1 − αc )/(K − 1). 3. Ec µ(d) The validity of the first two is clear from Table 3. The last assumption says that the amount of mass in the mean aggregate distribution that is off the true class tends to be uniformly distributed over the other classes. Let SK denote the K-dimensional simplex (probability vectors in RK ), and let Uc = {µ : arg maxd µ(d) = c}, an open convex subset of SK . Define φc to be the (Euclidean) distance from Ec µ¯ to ∂Uc , the boundary of Uc . Clearly ¯ < φc implies that arg maxd µ(d) = c, where k·k denotes Euclidean kµ−Ec µk norm. This is used below to bound the misclassification rate. First, however,
Shape Quantization and Recognition
1573
Table 3: Estimates of αc , γc , and ec for Ten Classes. Class 0 αc γc ec
1
2
3
4
5
6
7
8
9
0.66 0.86 0.80 0.74 0.74 0.64 0.56 0.86 0.49 0.68 0.03 0.01 0.01 0.01 0.03 0.02 0.04 0.01 0.02 0.01 0.14 0.04 0.03 0.04 0.11 0.13 0.32 0.02 0.23 0.05
we need to compute φc . Clearly, ∂Uc = ∪d:d6=c {µ ∈ SK : µ(c) = µ(d)}. From symmetry arguments, a point in ∂Uc that achieves the minimum distance to Ec µ¯ will lie in each of the sets in the union above. A straightforward computation involving orthogonal projections then yields φc = √ (αc K − 1)/ 2(K − 1). Using Chebyshev’s inequality, a crude upper bound on the misclassification rate for class c is obtained as follows: ¯ d) 6= c|Y = c) P(Yˆ S 6= c|Y = c) = P(x : arg max µ(x, d ° ≤ P(kµ¯ − Ec µ¯ ° > φc |Y = c) °2 1 ≤ 2 Ekµ¯ − Ec µ¯ ° φc " N K X 1 X = 2 2 Var(µTn (d)|Y = c) φc N d=1 n=1 +
X
#
Cov(µTn (d), µTm (d)|Y = c) .
n6=m
Let ηc denote the sum of the conditional variances, and let γc denote the sum of the conditional covariances, both averaged over the trees: K N X 1 X Var(µTn (d)|Y = c) = ηc N n=1 d=1 K 1 XX Cov(µTn (d), µTm (d)|Y = c) = γc . N2 n6=m d=1
We see that P(Yˆ S 6= c|Y = c) ≤
γc + ηc /N 2(γc + ηc /N)(K − 1)2 = . φc2 (αc K − 1)2
1574
Yali Amit and Donald Geman
Since ηc /N will be small compared with γc , the key parameters are αc and γc . This inequality yields only coarse bounds. However, it is clear that under the assumptions above, high classification rates are feasible as long as γc is sufficiently small and αc is sufficiently large, even if the estimates µTn are poor. Observe that the N trees form a simple random sample from some large population T of trees under a suitable distribution on T . This is due to the randomization aspect of tree construction. (Recall that at each node, the splitting rule is chosen from a small random sample of queries.) Both Ec µ¯ and the sum of variances are sample means of functionals on T . The sum of the covariances has the form of a U statistic. Since the trees are drawn independently and the range of the corresponding variables is very small (typically less than 1), standard statistical arguments imply that these sample means are close to the corresponding population means for a moderate number N of trees, say, tens or hundreds. In other words, αc ∼ ET EX (µT (c)|Y = c) and P γc ∼ ET ×T Kd=1 CovX (µT1 (d), µT2 (d)|Y = c). Thus the conditions on αc and γc translate into conditions on the corresponding expectations over T , and the performance variability among the trees can be ignored. Table 3 shows some estimates of αc and γc and the resulting bound ec on the misclassification rate P(Yˆ S 6= c|Y = c). Ten pairs of random trees were made on ten classes to estimate γc and αc . Again, the bounds are crude; they could be refined by considering higher-order joint moments of the trees. 8 Generalization For convenience, we will consider two types of generalization, referred to as interpolation and extrapolation. Our use of these terms may not be standard and is decidedly ad hoc. Interpolation is the easier case; both the training and testing samples are randomly drawn from (X, P), and the number of training samples is sufficiently large to cover the space X. Consequently, for most test points, the classifier is being asked to interpolate among nearby training points. By extrapolation we mean situations in which the training samples do not represent the space from which the test samples are drawn—for example, training on a very small number of samples per symbol (e.g., one); using different perturbation models to generate the training and test sets, perhaps adding more severe scaling or skewing; or degrading the test images with correlated noise or lowering the resolution. Another example of this occurred at the first NIST competition (Wilkinson et al., 1992); the hand-printed digits in the test set were written by a different population from those in the distributed training set. (Not surprisingly, the distinguishing feature of the winning algorithm was the size and diversity of the actual samples used to train the classifier.) One way Pto characterize such situations is to regard P as a mixture distribution P = i αi Pi , where the Pi might correspond to writer
Shape Quantization and Recognition
1575
Table 4: Classification Rates for Various Training Sample Sizes Compared with Nearest-Neighbor Methods. Sample Size
Trees
NN(B)
NN(raw)
1 8 32
44% 87 96
11% 57 74
5% 31 55
populations, perturbation models, or levels of degradation, for instance. In complex visual recognition problems, the number of terms might be very large, but the training samples might be drawn from relatively few of the Pi and hence represent a biased sample from P. In order to gauge the difficulty of the problem, we shall consider the performance of two other classifiers, based on k-nearest-neighbor classification with k = 5, which was more or less optimal in our setting. (Using nearestneighbors as a benchmark is common; see, for example, Geman et al., 1992; Khotanzad & Lu, 1991.) Let NN(raw) refer to nearest-neighbor classification based on Hamming distance in (binary) image space, that is, between bitmaps. This is clearly the wrong metric, but it helps to calibrate the difficulty of the problem. Of course, this metric is entirely blind to invariance but is not entirely unreasonable when the symbols nearly fill the bounding box and the degree of perturbation is limited. Let NN(B) refer to nearest-neighbor classification based on the binary tag arrangements. Thus, two images x and x0 are compared by evaluating Q(x) and Q(x0 ) for all Q ∈ B0 ⊂ B and computing the Hamming distance between the corresponding binary sequences. B0 was chosen as the subset of binary tag arrangements that split X to within 5 percent of fifty-fifty. There were 1510 such queries out of the 15,376 binary tag arrangements. Due to invariance and other properties, we would expect this metric to work better than Hamming distance in image space, and of course it does (see below). 8.1 Interpolation. One hundred (randomized) trees were constructed from a training data set with thirty-two samples for each of the K = 293 symbols. The average classification rate per tree on a test set V consisting of 100 samples per symbol is 27 percent. However, the performance of the classifier Yˆ S based on 100 trees is 96 percent. This clearly demonstrates the weak dependence among randomized trees (as well as the discriminating power of the queries). With the NN(B)-classifier, the classification rate was 74 percent; with NN(raw), the rate is 55 percent (see Table 4). All of these rates are on the test set. When the only random perturbations are nonlinear (i.e., no scaling, rotation, or skew), there is not much standardization that can be done to the
1576
Yali Amit and Donald Geman
Figure 10: LATEX symbols perturbed with only nonlinear deformations.
raw image (see Figure 10). With thirty-two samples per symbol, NN(raw) climbs to 76 percent, whereas the trees reach 98.5 percent. 8.2 Extrapolation. We also grew trees using only the original prototypes x∗c , c = 1, . . . , 293, recursively dividing this group until pure leaves were obtained. Of course, the trees are relatively shallow. In this case, only about half the symbols in X could then be recognized (see Table 4). The 100 trees grown with thirty-two samples per symbol were tested on samples that exhibit a greater level of distortion or variability than described up to this point. The results appear in Table 5. “Upscaling” (resp. “downscaling”) refers to uniform sampling between the original scale and twice (resp. half) the original scale, as in the top (resp. middle) panel of Figure 11; “spot noise” refers to adding correlated noise (see the top panel of Figure 8). Clutter (see the bottom panel of Figure 11) refers to the addition of pieces of other symbols in the image. All of these distortions came in addition to the random nonlinear deformations, skew, and rotations. Downscaling creates more confusions due to extreme thinning of the stroke. Notice that the NN(B) classifier falls apart with spot noise. The reason is the number of false positives: tags due to the noise induce random occurrences of simple arrangements. In contrast, complex arrangements A are far less likely to be found in the image by pure chance; therefore, chance occurrences are weeded out deeper in the tree. 8.3 Note. The purpose of all the experiments in this article is to illustrate various attributes of the recognition strategy. No effort was made to optimize the classification rates. In particular, the same tags and tree-making
Shape Quantization and Recognition
1577
Table 5: Classification Rates for Various Perturbations. Type of Perturbation
Trees
NN(B)
NN(raw)
Original Upscaling Downscaling Spot noise Clutter
96% 88 80 71 74
74% 57 52 28 27
55% 0 0 57 59
Figure 11: (Top) Upscaling. (Middle) Downscaling. (Bottom) Clutter.
protocol were used in every experiment. Experiments were repeated several times; the variability was negligible. One direction that appears promising is explicitly introducing different protocols from tree to tree in order to decrease the dependence. One small experiment was carried out in this direction. All the images were subsampled to half the resolution; for example, 32 × 32 images become 16 × 16. A tag tree was made with 4 × 4 subimages from the subsampled data set, and one hundred trees were grown using the subsampled training set. The
1578
Yali Amit and Donald Geman
output of these trees was combined with the output of the original trees on the test data. No change in the classification rate was observed for the original test set. For the test set with spot noise, the two sets of trees each had a classification rate of about 72 percent. Combined, however, they yield a rate of 86 percent. Clearly there is a significant potential for improvement in this direction.
9 Incremental Learning and Universal Trees The parameters µn,τ (c) = P(Y = c|Tn = τ ) can be incrementally updated with new training samples. Given a set of trees, the actual counts from the training set (instead of the normalized distributions) are kept in the terminal nodes τ . When a new labeled sample is obtained, it can be dropped down each of the trees and the corresponding counters incremented. There is no need to keep the image itself. This separation between tree construction and parameter estimation is crucial. It provides a mechanism for gradually learning to recognize an increasing number of shapes. Trees originally constructed with training samples from a small number of classes can eventually be updated to accommodate new classes; the parameters can be reestimated. In addition, as more data points are observed, the estimates of the terminal distributions can be perpetually refined. Finally, the trees can be further deepened as more data become available. Each terminal node is assigned a randomly chosen list of minimal extensions of the pending arrangement. The answers to these queries are then calculated and stored for each new labeled sample that reaches that node; again there is no need to keep the sample itself. When sufficiently many samples are accumulated, the best query on the list is determined by a simple calculation based on the stored information, and the node can then be split. The adaptivity to additional classes is illustrated in the following experiment. A set of one hundred trees was grown with training samples from 50 classes randomly chosen from the full set of 293 classes. The trees were grown to depth 10 just as before (see section 8). Using the original training set of thirty-two samples per class for all 293 classes, the terminal distributions were estimated and recorded for each tree. The aggregate classification rate on all 293 classes was about 90 percent, as compared with about 96 percent when the full training set is used for both quantization and parameter estimation. Clearly fifty shapes are sufficient to produce a reasonably sharp quantization of the entire shape space. As for improving the parameter estimates, recall that the one hundred trees grown with the pure symbols reached 44 percent on the test set. The terminal distributions of these trees were then updated using the original training set of thirty-two samples per symbol. The classification rate on the same test set climbed from 44 percent to 90 percent.
Shape Quantization and Recognition
1579
10 Fast Indexing One problem with recognition paradigms such as “hypothesize and test” is determining which particular hypothesis to test. Indexing into the shape library is therefore a central issue, especially with methods based on matching image data to model data and involving large numbers of shape classes. The standard approach in model-based vision is to flag plausible interpretations by searching for key features or discriminating parts in hierarchical representations. Indexing efficiency seems to be inversely related to stability with respect to image degradation. Deformable templates are highly robust because they provide a global interpretation for many of the image data. However, a good deal of searching may be necessary to find the right template. The method of invariant features lies at the other extreme of this axis: the indexing is one shot, but there is not much tolerance to distortions of the data. We have not attempted to formulate this trade-off in a manner susceptible to experimentation. We have noticed, however, that multiple trees appear to offer a reliable mechanism for fast indexing, at least within the framework of this article and in terms of narrowing down the number of possible classes. For example, in the original experiment with 96 percent classification rate, the five highest-ranking classes in the aggregate distribution µ¯ contained the true class in all but four images in a test set of size 1465 (five samples per class). Even with upscaling, for example, the true label was among the top five in 98 percent of the cases. These experiments suggest that very high recognition rates could be obtained with final tests dedicated to ambiguous cases, as determined, for example, by the mode of the µ. ¯ 11 Handwritten Digit Recognition The optical character recognition (OCR) problem has many variations, and the literature is immense; one survey is Mori, Suen, and Yamamoto (1992). In the area of handwritten character recognition, perhaps the most difficult problem is the recognition of unconstrained script; zip codes and handdrawn checks also present a formidable challenge. The problem we consider is off-line recognition of isolated binary digits. Even this special case has attracted enormous attention, including a competition sponsored by the NIST (Wilkinson et al., 1992), and there is still no solution that matches human performance, or even one that is commercially viable except in restricted situations. (For comparisons among methods, see Bottou et al., 1994, and the lucid discussion in Brown et al., 1993.) The best reported rates seem to be those obtained by the AT&T Bell Laboratories: up to 99.3 percent by training and testing on composites of the NIST training and test sets (Bottou et al., 1994). We present a brief summary of the results of experiments using the treebased shape quantization method to the NIST database. (For a more detailed
1580
Yali Amit and Donald Geman
Figure 12: Random sample of test images before (top) and after (bottom) preprocessing.
description, see Geman et al., 1996.) Our experiments were based on portions of the NIST database, which consists of approximately 223,000 binary images of isolated digits written by more than 2000 writers. The images vary widely in dimensions, ranging from about twenty to one hundred rows, and they also vary in stroke thickness and other attributes. We used 100,000 for training and 50,000 for testing. A random sample from the test set is shown in Figure 12. All results reported in the literature utilize rather sophisticated methods of preprocessing, such as thinning, slant correction, and size normalization. For the sake of comparison, we did several experiments using a crude form of slant correction and scaling, and no thinning. Twenty-five trees were made. We stopped splitting when the number of data points in the second largest class fell below ten. The depth of the terminal nodes (i.e., number of questions asked per tree) varied widely, the average over trees being 8.8. The average number of terminal nodes was about 600, and the average classification rate (determined by taking the mode of the terminal distribution) was about 91 percent. The best error rate we achieved with a single tree was about 7 percent. The classifier was tested in two ways. First, we preprocessed (scaled and
Shape Quantization and Recognition
1581
1 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.9 5
10
15
20
25
Figure 13: Classification rate versus number of trees.
slant corrected) the test set in the same manner as the training set. The resulting classification rate is 99.2 percent (with no rejection). Figure 13 shows how the classification rate grows with the number of trees. Recall from section 6.1 that the estimated class of an image x is the mode of the aggregate distribution µ(x). ¯ A good measure of the confidence in this estimate is the value of µ(x) ¯ at the mode; call it M(x). It provides a natural mechanism for rejection by classifying only those images x for which M(x) > m; no rejection corresponds to m = 0. For example, the classification rate is 99.5 percent with 1 percent rejection and 99.8 percent with 3 percent rejection. Finally, doubling the number of trees makes the classification rates 99.3 percent, 99.6 percent, and 99.8 percent at 0, 1, and 2 percent rejection, respectively. We performed a second experiment in which the test data were not preprocessed in the manner of the training data; in fact, the test images were classified without utilizing the size of the bounding box. This is especially important in the presence of noise and clutter when it is essentially impossi-
1582
Yali Amit and Donald Geman
ble to determine the size of the bounding box. Instead, each test image was classified with the same set of trees at two resolutions (original and halved) and three (fixed) slants. The highest of the resulting six modes determines the classification. The classification rate was 98.9 percent. We classify approximately fifteen digits per second on a single processor SUN Sparcstation 20 (without special efforts to optimize the code); the time is approximately equally divided between transforming to tags and answering questions. Test data can be dropped down the trees in parallel, in which case classification would become approximately twenty-five times faster. 12 Comparison with ANNs The comparison with ANNs is natural in view of their widespread use in pattern recognition (Werbos, 1991) and several common attributes. In particular, neither approach is model based in the sense of utilizing explicit shape models. In addition, both are computationally very efficient in comparison with model-based methods as well as with memory-based methods (e.g., nearest neighbors). Finally, in both cases performance is diminished by overdedication to the training data (overfitting), and problems result from deficient approximation capacity or parameter estimation, or both. Two key differences are described in the following two subsections. 12.1 The Posterior Distribution and Generalization Error. It is not clear how a feature vector such as Q could be accommodated in the ANN framework. Direct use is not likely to be successful since ANNs based on very high-dimensional input suffer from poor generalization (Baum & Haussler, 1989), and very large training sets are then necessary in order to approximate complex decision boundaries (Raudys & Jain, 1991). The role of the posterior distribution P(Y = c|Q) is more explicit in our approach, leading to a somewhat different explanation of the sources of error. In our case, the approximation error results from replacing the entire feature set by an adaptive subset (or really many subsets, one per tree). The difference between the full posterior and the tree-based approximations can be thought of as the analog of approximation error in the analysis of the learning capacity of ANNs (see, e.g., Niyogi & Girosi, 1996). In that case one is interested in the set of functions that can be generated by the family of networks of a fixed architecture. Some work has centered on approximation of the particular function Q → P(Y = c|Q). For example, Lee et al. (1991) consider least-squares approximation to the Bayes rule under conditional independence assumptions on the features; the error vanishes as the number of hidden units goes to infinity. Estimation error in our case results from an inadequate amount of data to estimate the conditional distributions at the leaves of the trees. The analogous situation for ANNs is well known: even for a network of unlimited
Shape Quantization and Recognition
1583
capacity, there is still the problem of estimating the weights from limited data (Niyogi & Girosi, 1996). Finally, classifiers based on ANNs address classification in a less nonparametric fashion. In contrast to our approach, the classifier has an explicit ˆ Of functional (input-output) representation from the feature vector to Y. course, a great many parameters may need to be estimated in order to achieve low approximation error, at least for very complex classification problems such as shape recognition. This leads to a relatively large variance component in the bias/variance decomposition. In view of these difficulties, the small error rates achieved by ANNs on handwritten digits and other shape recognition problems are noteworthy. 12.2 Invariance and the Visual System. The structure of ANNs for shape classification is partially motivated by observations about processing in the visual cortex. Thus, the emphasis has been on parallel processing and the local integration (“funneling”) of information from one layer to the next. Since this local integration is done in a spatially stationary manner, translation invariance is achieved. The most successful example in the context of handwritten digit is the convolution neural net LeNET (LeCun et al., 1990). Scale, deformation, and other forms of invariance are achieved to a lesser degree, partially from using gray-level data and soft thresholds, but mainly by utilizing very large training sets and sophisticated image normalization procedures. The neocognitron (Fukushima & Miyake, 1982; Fukushima & Wake, 1991) is an interesting mechanism to achieve a degree of robustness to shape deformations and scale changes in addition to translation invariance. This is done using hidden layers, which carry out a local ORing operation mimicking the operation of the complex neurons in V1. These layers are referred to as C-type layers in Fukushima and Miyake (1982). A dual-resolution version (Gochin, 1994) is aimed at achieving scale invariance over a larger range. Spatial stationarity of the connection weights, the C-layers, and the use of multiple resolutions are all forms of ORing aimed at achieving invariance. The complex cell in V1 is indeed the most basic evidence of such an operation in the visual system. There is additional evidence for disjunction at a very global scale in the cells of the inferotemporal cortex with very large receptive fields. These respond to certain shapes at all locations in the receptive field, at a variety of scales but also for a variety of nonlinear deformations such as changes in proportions of certain parts (see Ito, Fujita, Tamura, & Tanaka, 1994; Ito, Tamura, Fujita, & Tanaka, 1995). It would be extremely inefficient to obtain such invariance through ORing of responses to each of the variations of these shapes. It would be preferable to carry out the global ORing at the level of primitive generic features, which can serve to identify and discriminate between a large class of shapes. We have demonstrated that global features consisting of geometric arrange-
1584
Yali Amit and Donald Geman
ments of local responses (tags) can do just that. The detection of any of the tag arrangements described in the preceding sections can be viewed as an extensive and global ORing operation. The ideal arrangement, such as a vertical edge northwest of a horizontal edge, is tested at all locations, all scales, and a large range of angles around the northwest vector. A positive response to any of these produces a positive answer to the associated query. This leads to a property we have called semi-invariance and allows our method to perform well without preprocessing and without very large training sets. Good performance extends to transformations or perturbations not encountered during training (e.g., scale changes, spot noise, and clutter). The local responses we employ are very similar to ones employed at the first level of the convolution neural nets, for example, in Fukushima and Wake (1991). However, at the next level, our geometric arrangements are defined explicitly and designed to accommodate a priori assumptions about shape regularity and variability. In contrast, the features in convolution neural nets are all implicitly derived from local inputs to each layer and from the slack obtained by local ORing, or soft thresholding, carried out layer by layer. It is not clear that this is sufficient to obtain the required level of invariance, nor is it clear how portable they are from one problem to another. Evidence for correlated activities of neurons with distant receptive fields is accumulating, not only in V1 but in the lateral geniculate nucleus and in the retina (Neuenschwander and Singer, 1996). Gilbert, Das, Ito, Kapadia, and Westheimer (1996) report increasing evidence for more extensive horizontal connections. Large optical point spreads are observed, where subthreshold neural activation appears in an area covering the size of several receptive fields. When an orientation-sensitive cell fires in response to a stimulus in its receptive field, subthreshold activation is observed in cells with the same orientation preference in a wide area outside the receptive field. It appears therefore that the integration mechanisms in V1 are more global and definitely more complex than previously thought. Perhaps the features described in this article can contribute to bridging the gap between existing computational models and new experimental findings regarding the visual system. 13 Conclusion We have studied a new approach to shape classification and illustrated its performance in high dimensions, with respect to both the number of shape classes and the degree of variability within classes. The basic premise is that shapes can be distinguished from one another by a sufficiently robust and invariant form of recursive partitioning. This quantization of shape space is based on growing binary classification trees using geometric arrangements among local topographic codes as splitting rules. The arrangements are semi-invariant to linear and nonlinear image transformations. As a result,
Shape Quantization and Recognition
1585
the method generalizes well to samples not encountered during training. In addition, due to the separation between quantization and estimation, the framework accommodates unsupervised and incremental learning. The codes are primitive and redundant, and the arrangements involve only simple spatial relationships based on relative angles and distances. It is not necessary to isolate distinguished points along shape boundaries or any other special differential or topological structures, or to perform any form of grouping, matching, or functional optimization. Consequently, a virtually infinite collection of discriminating shape features is generated with elementary computations at the pixel level. Since no classifier based on the full feature set is realizable and since it is impossible to know a priori which features are informative, we have selected features and built trees at the same time by inductive learning. Another data-driven nonparametric method for shape classification is based on ANNs, and a comparison was drawn in terms of invariance and generalization error, leading to the conclusion that prior information plays a relatively greater role in our approach. We have experimented with the NIST database of handwritten digits and with a synthetic database constructed from linear and nonlinear deformations of about three hundred LATEX symbols. Despite a large degree of within-class variability, the setting is evidently simplified since the images are binary and the shapes are isolated. Looking ahead, the central question is whether our recognition strategy can be extended to more unconstrained scenarios, involving, for example, multiple or solid objects, general poses, and a variety of image formation models. We are aware that our approach differs from the standard one in computer vision, which emphasizes viewing and object models, 3D matching, and advanced geometry. Nonetheless, we are currently modifying the strategy in this article in order to recognize 3D objects against structured backgrounds (e.g., cluttered desktops) based on gray-level images acquired from ordinary video cameras (see Jedynak & Fleuret, 1996). We are also looking at applications to face detection and recognition. Acknowledgments We thank Ken Wilder for many valuable suggestions and for carrying out the experiments on handwritten digit recognition. Yali Amit has been supported in part by the ARO under grant DAAL-03-92-0322. Donald Geman has been supported in part by the NSF under grant DMS-9217655, ONR under contract N00014-91-J-1021, and ARPA under contract MDA972-93-10012. References Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization? Neural Comp., 1, 151–160.
1586
Yali Amit and Donald Geman
Binford, T. O., & Levitt, T. S. (1993). Quasi-invariants: Theory and exploitation. In Proceedings of the Image Understanding Workshop (pp. 819–828). Washington D.C. Bottou, L., Cortes, C., Denker, J. S., Drucker, H., Guyon, I., Jackel, L. D., LeCun, Y., Muller, U. A., Sackinger, E., Simard, P., & Vapnik, V. (1994). Comparison of classifier methods: A case study in handwritten digit recognition. Proc. 12th Inter. Conf. on Pattern Recognition (Vol. 2, pp. 77–82). Los Alamitos, CA: IEEE Computer Society Press. Breiman, L. (1994). Bagging predictors (Tech. Rep. No. 451). Berkeley: Department of Statistics, University of California, Berkeley. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and regression trees. Belmont, CA: Wadsworth. Brown, D., Corruble, V., & Pittard, C. L. (1993). A comparison of decision tree classifiers with backpropagation neural networks for multimodal classification problems. Pattern Recognition, 26, 953–961. Burns, J. B., Weiss, R. S., & Riseman, E. M. (1993). View variation of point set and line segment features. IEEE Trans. PAMI, 15, 51–68. Casey, R. G., & Jih, C. R. (1983). A processor-based OCR system. IBM Journal of Research and Development, 27, 386–399. Casey, R. G., & Nagy, G. (1984). Decision tree design using a probabilistic model. IEEE Trans. Information Theory, 30, 93–99. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. J. Artificial Intell. Res., 2, 263–286. Forsyth, D., Mundy, J. L., Zisserman, A., Coelho, C., Heller, A., & Rothwell, C. (1991). Invariant descriptors for 3-D object recognition and pose. IEEE Trans. PAMI, 13, 971–991. Friedman, J. H. (1973). A recursive partitioning decision rule for nonparametric classification. IEEE Trans. Comput., 26, 404–408. Fukushima, K., & Miyake, S. (1982). Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 15, 455–469. Fukushima, K., & Wake, N. (1991). Handwritten alphanumeric character recognition by the neocognitron. IEEE Trans. Neural Networks, 2, 355–365. Gelfand, S. B., & Delp, E. J. (1991). On tree structured classifiers. In I. K. Sethi & A. K. Jain (Eds.), Artificial neural networks and statistical pattern recognition (pp. 51–70). Amsterdam: North-Holland. Geman, D., Amit, Y., & Wilder, K. (1996). Joint induction of shape features and tree classifiers (Tech. Rep.). Amherst: University of Massachusetts. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–58. Gilbert, C. D., Das, A., Ito, M., Kapadia, M., & Westheimer, G. (1996). Spatial integration and cortical dynamics. Proc. Natl. Acad. Sci., 93, 615–622. Gochin, P. M. (1994). Properties of simulated neurons from a model of primate inferior temporal cortex. Cerebral Cortex, 5, 532–543. Guo, H., & Gelfand, S. B. (1992). Classification trees with neural network feature
Shape Quantization and Recognition
1587
extraction. IEEE Trans. Neural Networks, 3, 923–933. Hastie, T., Buja, A., & Tibshirani, R. (1995). Penalized discriminant analysis. Annals of Statistics, 23, 73–103. Hussain, B., & Kabuka, M. R. (1994). A novel feature recognition neural network and its application to character recognition. IEEE Trans. PAMI, 16, 99–106. Ito, M., Fujita, I., Tamura, H., Tanaka, K. (1994). Processing of contrast polarity of visual images of inferotemporal cortex of Macaque monkey. Cerebral Cortex, 5, 499–508. Ito, M., Tamura, H., Fujita, I., & Tanaka, K. (1995). Size and position invariance of neuronal response in monkey inferotemporal cortex. J. Neuroscience, 73(1), 218–226. Jedynak, B., Fleuret, F. (1996). Reconnaissance d’objets 3D a` l’aide d’arbres de classification. In Proc. Image Com 96. Bordeaux, France. Jung, D.-M., & Nagy, G. (1995). Joint feature and classifier design for OCR. In Proc. 3rd Inter. Conf. Document Analysis and Processing (Vol. 2, pp. 1115–1118). Montreal. Los Alamitos, CA: IEEE Computer Society Press. Khotanzad, A., & Lu, J.-H. (1991). Shape and texture recognition by a neural network. In I. K. Sethi & A. K. Jain (Eds.), Artificial Neural Networks and Statistical Pattern Recognition. Amsterdam: North-Holland. Knerr, S., Personnaz, L., & Dreyfus, G. (1992). Handwritten digit recognition by neural networks with single-layer training. IEEE Trans. Neural Networks, 3, 962–968. Kong, E. B., & Dietterich, T. G. (1995). Error-correcting output coding corrects bias and variance. In Proc. of the 12th International Conference on Machine Learning (pp. 313–321). San Mateo, CA: Morgan Kaufmann. Lamdan, Y., Schwartz, J. T., & Wolfson, H. J. (1988). Object recognition by affine invariant matching. In IEEE Int. Conf. Computer Vision and Pattern Rec. (pp. 335–344). Los Alamitos, CA: IEEE Computer Society Press. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1990). Handwritten digit recognition with a back-propagation network. In D. S. Touretzky (Ed.), Advances in Neural Information (Vol. 2). San Mateo, CA: Morgan Kaufmann. Lee, D.-S., Srihari, S. N., & Gaborski, R. (1991). Bayesian and neural network pattern recognition: A theoretical connection and empirical results with handwritten characters. In I. K. Sethi & A. K. Jain (Eds.), Artificial Neural Networks and Statistical Pattern Recognition. Amsterdam: North-Holland. Martin, G. L., & Pitman, J. A. (1991). Recognizing hand-printed letters and digits using backpropagation learning. Neural Computation, 3, 258–267. Mori, S., Suen, C. Y., & Yamamoto, K. (1992). Historical review of OCR research and development. In Proc. IEEE (Vol. 80, pp. 1029–1057). New York: IEEE. Mundy, J. L., & Zisserman, A. (1992). Geometric invariance in computer vision. Cambridge, MA: MIT Press. Neuenschwander, S., & Singer, W. (1996). Long-range synchronization of oscillatory light responses in cat retina and lateral geniculate nucleus. Nature, 379(22), 728–733.
1588
Yali Amit and Donald Geman
Niyogi, P., & Girosi, F. (1996). On the relationship between generalization error, hypothesis complexity, and sample complexity for radial basis functions. Neural Comp., 8, 819–842. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106. Raudys, S., & Jain, A. K. (1991). Small sample size problems in designing artificial neural networks. In I. K. Sethi & A. K. Jain (Eds.), Artificial Neural Networks and Statistical Pattern Recognition. Amsterdam: North-Holland. Reiss, T. H. (1993). Recognizing planar objects using invariant image features. Lecture Notes in Computer Science no. 676. Berlin: Springer-Verlag. Sabourin, M., & Mitiche, A. (1992). Optical character recognition by a neural network. Neural Networks, 5, 843–852. Sethi, I. K. (1991). Decision tree performance enhancement using an artificial neural network implementation. In I. K. Sethi & A. K. Jain (Eds.), Artificial Neural Networks and Statistical Pattern Recognition. Amsterdam: NorthHolland. Shlien, S. (1990). Multiple binary decision tree classifiers. Pattern Recognition, 23, 757–763. Simard, P. Y., LeCun, Y. L., & Denker, J. S. (1994). Memory-based character recognition using a transformation invariant metric. In Proc. 12th Inter. Conf. on Pattern Recognition (Vol. 2, pp. 262–267). Los Alamitos, CA: IEEE Computer Society Press. Werbos, P. (1991). Links between artificial neural networks (ANN) and statistical pattern recognition. In I. K. Sethi & A. K. Jain (Eds.), Artificial Neural Networks and Statistical Pattern Recognition. Amsterdam: North-Holland. Wilkinson, R. A., Geist, J., Janet, S., Grother, P., Gurges, C., Creecy, R., Hammond, B., Hull, J., Larsen, N., Vogl, T., & and Wilson, C. (1992). The first census optical character recognition system conference (Tech. Rep. No. NISTIR 4912). Gaithersburg, MD: National Institute of Standards and Technology.
Received April 18, 1996; accepted November 1, 1996.
Communicated by Eric Mjolsness
Airline Crew Scheduling with Potts Neurons Martin Lagerholm Carsten Peterson Bo Soderberg ¨ Department of Theoretical Physics, University of Lund, S¨olvegatan 14A, S-223 62 Lund, Sweden
A Potts feedback neural network approach for finding good solutions to resource allocation problems with a nonfixed topology is presented. As a target application, the airline crew scheduling problem is chosen. The topological complication is handled by means of a propagator defined in terms of Potts neurons. The approach is tested on artificial random problems tuned to resemble real-world conditions. Very good results are obtained for a variety of problem sizes. The computer time demand for the approach only grows like (number of flights)3 . A realistic problem typically is solved within minutes, partly due to a prior reduction of the problem size, based on an analysis of the local arrival and departure structure at the single airports. 1 Introduction In the past decade, feedback neural networks have emerged as a useful method to obtain good approximate solutions to various resource allocation problems (Hopfield & Tank, 1985; Peterson & Soderberg, ¨ 1989; Gisl´en, Soderberg, ¨ & Peterson, 1989, 1992). Most applications have concerned fairly academic problems like the traveling salesman problem (TSP) and various graph partition problems (Hopfield & Tank, 1985; Peterson & Soderberg, ¨ 1989). Gisl´en, Soderberg, ¨ and Peterson (1989, 1992) explored high school scheduling. The typical approach proceeds in two steps: (1) map the problem onto a neural network (spin) system with a problem-specific energy function, and (2) minimize the energy by means of deterministic mean field (MF) equations, which allow for a probabilistic interpretation. Two basic mapping variants are common: a hybrid (template) approach (Durbin & Willshaw, 1987) and a “purely” neural one. The template approach is advantageous for low-dimensional geometrical problems like the TSP, whereas for generic resource allocation problems, a purely neural Potts encoding is preferable. A challenging resource allocation problem is airline crew scheduling, where a given flight schedule is to be covered by a set of crew rotations, each consisting of a connected sequence of flights (legs), starting and ending at a Neural Computation 9, 1589–1599 (1997)
c 1997 Massachusetts Institute of Technology °
1590
Martin Lagerholm, Carsten Peterson, and Bo Soderberg ¨
given home base (hub). The total crew waiting time is then to be minimized, subject to a number of restrictions on the rotations. This application differs strongly from high school scheduling (Gisl´en, Soderberg, ¨ & Peterson, 1989, 1992) in the existence of nontrivial topological restrictions. A similar structure occurs in multitask telephone routing. A common approach to this problem is to convert it into a set covering problem, by generating a large pool of legal rotation templates and seeking a subset of the templates that precisely covers the entire flight schedule. Solutions to the set covering problem are then found with linear programming techniques or feedback neural network methods (Ohlsson, Peterson, & Soderberg, ¨ 1993). A disadvantage with this method is that the rotation generation for computational reasons has to be nonexhaustive for a large problem; thus, only a fraction of the solution space is available. The approach to the crew scheduling problem developed in this article is quite different and proceeds in two steps. First, the full solution space is narrowed down using a reduction technique that removes a large part of the suboptimal solutions. Then, a mean field annealing approach based on Potts neurons is applied; a novel key ingredient is the use of a propagator formalism for handling topology, leg counting, and so forth. The method, which is explored on random artificial problems resembling real-world situations, performs well with respect to quality, with a computational requirement that grows like N3f , where N f is the number of flights. 2 Nature of the Problem Typically, a real-world flight schedule has a basic period of one week. Given such a schedule in terms of a set of N f weekly flights, with specified times and airports of departure and arrival, a crew is to be assigned to each flight such that the total crew waiting time is minimized, subject to the constraints: • Each crew must follow a connected path (a rotation) starting and ending at the hub (see Figure 1). • The number of flight legs in a rotation must not exceed a given upper bound Lmax . • The total duration (flight plus waiting time) of a rotation is similarly bounded by Tmax . These are the crucial and difficult constraints. In a real-world problem, there are often some twenty additional ones, which we neglect for simplicity; they constitute no additional challenge from an algorithmic point of view. Without the above constraints, the problem would reduce to that of minimizing waiting times independently at each airport (the local problems). These can be solved exactly in polynomial time, for example, by pairwise
Airline Crew Scheduling
1591
Figure 1: Schematic view of the three crew rotations starting and ending in a hub.
connecting arrival flights with subsequent departure flights. It is the global structural requirements that make the crew scheduling problem a challenge. 3 Reduction of Problem Size: Airport Fragmentation Prior to developing our artificial neural network method, we will describe a technique to reduce the size of problem, based on the local flight structure at each airport. With the waiting time between an arriving flight i and a departing flight j defined as ³ ´ (dep) = tj − t(arr) mod period, t(w) i ij
(3.1)
the total waiting time for a given problem can change only by an integer times the period. By demanding a minimal waiting time, the local problem at an airport typically can be split up into independent subproblems, where each contains a subset of the arrivals and an equally large subset of the departures. For example, with A/D denoting arrival/departure, the time ordering [AADDAD] can, if minimal waiting time is demanded, be divided into two subproblems: [AADD] and [AD]. Some of these are trivial ([AD]), forcing the crew of an arrival to continue to a particular departure. The minimal total wait time for the local problems is straightforward to compute wait . and will be denoted by Tmin wait also Similarly, by demanding a solution (assuming one exists) with Tmin for the constrained global problem, this can be reduced as follows: • Airport fragmentation. Divide each airport into effective airports corresponding to the nontrivial local subproblems. • Flight clustering. Join every forced sequence of flights into one effective
1592
Martin Lagerholm, Carsten Peterson, and Bo Soderberg ¨
composite flight, which will thus represent more than one leg and have a formal duration defined as the sum of the durations of its legs and the waiting times between them. The reduced problem thus obtained differs from the original problem only in an essential reduction of the suboptimal part of the solution space; the part with minimal waiting time is unaffected by the reduction. The resulting information gain, taken as the natural logarithm of the decrease in the number of possible configurations, empirically seems to scale approximately like 1.5 times (number of flights), and ranges from 100 to 2000 for the problems probed. The reduced problem may in most cases be further separated into a set of independent subproblems, which can be solved one by one. Some of the composite flights will formally arrive at the same effective airport they started from. This does not pose a problem. Indeed, if the airport in question is the hub, such a single flight constitutes a separate (trivial) subproblem, representing an entire forced rotation. Typically, one of the subproblems will be much larger than the rest and will be referred to as the kernel problem; the remaining subproblems will be essentially trivial. In the formalism below, we allow for the possibility that the problem to be solved has been reduced as described above, which means that flights may be composite. 4 Potts Encoding A naive way to encode the crew scheduling problem would be to introduce Potts spins in analogy with what was done by Gisl´en, Soderberg, ¨ and Peterson (1989, 1992), where each event (lecture) is mapped onto a resource unit (lecture room, time slot). This would require a Potts spin for each flight to handle the mapping onto crews. Since the problem consists of linking together sequences of (composite) flights such that proper rotations are formed, starting and ending at the hub, it appears more natural to choose an encoding where each flight i is mapped, via a Potts spin, onto the flight j to follow it in the rotation: ½ 1 if flight i precedes flight j in a rotation sij = 0 otherwise where it is understood that j is restricted to depart from the (effective) airport where i arrives. In order to ensure that proper rotations are formed, each flight has to be mapped onto precisely one other flight. This restriction is inherent in the Potts spin, defined to have precisely one component “on”: X sij = 1. (4.1) j
In order to start and terminate rotations, two auxiliary flights are intro-
Airline Crew Scheduling
=
+
1593
+
+
+
...
Figure 2: Expansion of the propagator Pij (°) in terms of sij . A line represents a flight, and (•) a landing and take-off.
duced at the hub: a dummy arrival a, forced to be mapped onto every proper hub departure, and a dummy departure b, to which every proper hub arrival must be mapped. No other mappings are allowed at the hub. Thus, a acts as a source and b as a sink of rotations; every rotation starts with a and terminates with b. Both have zero duration and leg count, as well as associated wait times. Global topological properties, leg counts, and durations of rotations, cannot be described in a simple way by polynomial functions of the spins. Instead, they are conveniently handled by means of a propagator matrix, P, derived from the Potts spin matrix s (including a, b components) as ³ ´ Pij = (1 − s)−1 ij X X X = δij + sij + sik skj + sik skl slj + sik skl slm smj + · · · (4.2) k
kl
klm
A pictorial expansion of the propagator is shown in Figure 2. The interpretation is obvious: Pij counts the number of connecting paths from flight i to j. Similarly, an element of the matrix square of P, X X Pik Pkj = δij + 2sij + 3 sik skj + · · · , (4.3) k
k
counts the total number of (composite) legs in the connecting paths, while the number of proper legs is given by X ¡ ¢ X ¡ ¢ Pik Lk Pkj = Li δij + sij Li + Lj + sik skj Li + Lk + Lj + · · · , (4.4) k
k
where Lk is P the intrinsic number of single legs in the composite flight k. Thus, Lij ≡ k Pik Lk Pkj /Pij gives the average leg count of the connecting paths. Similarly, the average duration (flight plus waiting time) of the paths from i to j amounts to P Tij ≡
k
P (w) Pik t(f) kl Pik tkl skl Plj k Pkj + , Pij
(4.5)
where t(f) i denotes the duration of the composite flight i, including the embedded waiting time. Furthermore, any closed loop (such as obtained if two
1594
Martin Lagerholm, Carsten Peterson, and Bo Soderberg ¨
flights are mapped onto each other) will make P singular; for a proper set of rotations, det P = 1. The problem can now be formulatet as follows. Minimize Tab subject to the contraints:1 Lib ≤ Lmax
(4.6)
Tib ≤ Tmax
(4.7)
for every departure i at the hub. 5 Mean Field Approach We use a mean field (MF) annealing approach in the search for the global minimum. The discrete Potts variables, sij , are replaced by continuous MF Potts neurons, vij . They represent thermal averages hsij iT , with T an artificial temperature to be slowly decreased (annealed), and have an obvious interpretation of probabilities (for flight i to be followed by j). The corresponding probabilistic propagator P will be defined as the matrix inverse of 1 − v. The neurons are updated by iterating the MF equations exp(uij /T) vij = P k exp(uik /T)
(5.1)
for one flight i at a time, by first zeroing the ith row of v,2 and then computing the relevant local fields uij entering equation 5.1 as uij = −c1 t(w) ij − c2 Pji − c3 − c5 9
³
(ij) Lrot
X ´
³ ´ (ij) vkj − c4 9 Trot − Tmax
k
− Lmax ,
(5.2)
where j is restricted to be a possible continuation flight to i. In the first term, t(w) ij is the waiting time between flight i and j. The second term suppresses closed loops, and the third term is a soft exclusion term, penalizing solutions where two flights point to the same next flight. In the fourth and fifth terms, (ij) (ij) Lrot stands for the total leg count and Trot for the duration of the rotation if i were to be mapped onto j and amount to (ij)
Lrot = Lai + Ljb (ij) Trot
1 2
= Tai +
t(w) ij
(5.3) + Tjb .
(5.4)
Minimizing this minimizes the total wait time, since the total flight time is fixed. To eliminate self-couplings.
Airline Crew Scheduling
1595
The penalty function 9, used to enforce the inequality constraints (Ohlsson et al., 1993; Tank & Hopfield, 1986), is defined by 9(x) = x2(x) where 2 is the Heaviside step function. Normally, the local fields uij are derived from a suitable energy function; however, for reasons of simplicity, some of the terms in equation 5.2 are chosen in a more pragmatic way. After an initial computation of the propagator P from scratch, it is subsequently updated according to the Sherman-Morrison algorithm for incremental matrix inversion (Press, Flannery, Teukolsky, & Vettering, 1986). An update of the ith row of v, vij → vij + δj , generates precisely the following change in the propagator P: Pki zl Pkl → Pkl + 1 − zi X δj Pjl . where zl =
(5.5) (5.6)
j
Inverting the matrix from scratch would take O(N3 ) operations, while the (exact) scheme devised above requires only O(N2 ) per row. As the temperature goes to zero, a solution crystallizes in a winner-takesall dynamics: for each flight i, the largest uij determines the continuation flight j to be chosen. 6 Test Problems In choosing test problems, our aim has been to maintain a reasonable degree of realism, while avoiding unnecessary complication and at the same time not limiting ourselves to a few real-world problems, where one can always tune parameters and procedures to get good performance. In order to accomplish this, we have analyzed two typical real-world template problems obtained from a major airline: one consisting of long-distance (LD) and the other of short- and medium-distance (SMD) flights. As can be seen in Figure 3, LD flight time distributions are centered around long times, with a small hump for shorter times representing local continuations of long flights. The SMD flight times have a more compact distribution. For each template we have made a distinct problem generator producing random problems resembling the template. A problem with a specified number of airports and flights is generated as follows. First, the distances (flight times) between airports are chosen randomly from a suitable distribution. Then a flight schedule is built up in the form of legal rotations starting and ending at the hub. For every new leg, the waiting time and the next airport are randomly chosen in a way designed to make the resulting problems statistically resemble the respective template problem. Due to the excessive time consumption of the available exact methods (e.g., branch and bound), the performance of the Potts approach cannot be
1596
Martin Lagerholm, Carsten Peterson, and Bo Soderberg ¨
30
180
a
b
160
25 140
20
120
100
15 80
10
60
40
5 20
0
0
0
100
200
300
400
500
600
700
0
50
100
150
200
250
300
350
Figure 3: Flight time distributions in minutes for (a) LD and (b) SMD template problems.
tested against these—except for in this context ridiculously small problems, for which the Potts solution quality matches that of an exact algorithm. For artificial problems of a more realistic size, we circumvent this obstacle. Since problems are generated by producing a legal set of rotations, we wait ; if add in the generator a final check that the implied solution yields Tmin not, a new problem is generated. Theoretically, this might introduce a bias in the problem ensemble; empirically, however, no problems have had to be wait . redone. Also the two real problems turn out to be solvable at Tmin Each problem is then reduced as described above (using a negligible amount of computer time), and the kernel problem is stored as a list of flights, with all traces of the generating rotations removed. 7 Results We have tested the performance of the Potts MF approach for both LD and SMD kernel problems of varying sizes. As an annealing schedule for the serial updating of the MF equations, 5.1 and 5.5), we have used Tn =kTn−1 with k = 0.9. In principle, a proper value for the initial temperature T0 can be estimated from linearizing the dynamics of the MF equations. We have chosen a more pragmatic approach: the initial temperature is assigned a tentative value of 1.0, which is dynamically adjusted based on the measured rate of change of the neurons until a proper T0 is found. The values used for the coefficients ci in equation 5.2 are chosen in an equally simple and pragmatic way: c1 = 1/period, c2 = c3 = 1, while c4 = 1/hTrot i and c5 = wait ) and hLrot i the 1/hLrot i, where hTrot i is the average duration (based on Tmin
Airline Crew Scheduling
1597
Table 1: Average Performance of the Potts Algorithm for LD Problems. Nf
Na
hNeff i f
hNaeff i
hRi
hCPU timei
75 100 150 200 225 300
5 5 10 10 15 15
23 50 55 99 84 154
8 17 17 29 26 46
0.0 0.0 0.0 0.0 0.0 0.0
0.0 sec 0.2 sec 0.3 sec 1.3 sec 0.7 sec 3.4 sec
Notes: The superscript “eff” refers to the kernel problem; subscripts “ f ” and “a” refer to flight and airport, respectively. The averages are taken with ten different problems for each N f . The performance is measured as the difference between the waiting time in the Potts and the local solutions divided by the period. The CPU time refers to DEC Alpha 2000.
average leg count per rotation, both of which can be computed beforehand. It is worth stressing that these parameter settings have been used for the entire range of problem sizes probed. When evaluating a solution obtained with the Potts approach, a check is done as to whether it is legal (if not, a simple postprocessor restores legality; this is only occasionally needed); then the solution quality is probed by measuring the excess waiting time R, R=
wait − T wait TPotts min , period
(7.1)
which is a nonnegative integer for a legal solution. For a given problem size, as given by the desired number of airports Na and flights N f , a set of ten distinct problems is generated. Each problem is subsequently reduced, and the Potts algorithm is applied to the resulting kernel problem. The solutions are evaluated, and the average R for the set is computed. The results for a set of problem sizes ranging from N f ' 75 to 1000 are shown in Tables 1 and 2; for the two real problems, see Table 3. The results are quite impressive. The Potts algorithm has solved all problems, and with a very modest CPU time consumption, of which the major 3 3 part goes into updating the P matrix. The sweep time scales like (Neff f ) ∝ Nf , with a small prefactor, are due to the fast method used (see equations 5.5 and 5.6). This should be multiplied by the number of sweeps needed— empirically between 30 and 40, independent of problem size.3 3
The minor apparent deviation from the expected scaling in Tables 1, 2, and 3 is due
1598
Martin Lagerholm, Carsten Peterson, and Bo Soderberg ¨
Table 2: Average Performance of the Potts Algorithm for SMD Problems. Nf
Na
hNeff i f
hNaeff i
hRi
hCPU timei
600 675 700 750 800 900 1000
40 45 35 50 40 45 50
280 327 370 414 441 535 614
64 72 83 87 91 101 109
0.0 0.0 0.0 0.0 0.0 0.0 0.0
19 sec 35 sec 56 sec 90 sec 164 sec 390 sec 656 sec
Note: The averages are taken with ten different problems for each N f . Same notation as in Table 1.
Table 3: Average Performance of the Potts Algorithm for Ten Runs on the Two Real Problems. Nf
Na
hNeff i f
hNaeff i
hRi
hCPU timei
type
189 948
15 64
71 383
24 98
0.0 0.0
0.6 sec 184 sec
LD SMD
Note: Same notation as in Table 1.
8 Summary We have developed a mean field Potts approach for solving resource allocation problems with a nontrivial topology. The method is applied to airline crew scheduling problems resembling real-world situations. A novel key feature is the handling of global entities, sensitive to the dynamically changing “fuzzy” topology, by means of a propagator formalism. Another important ingredient is the problem size reduction achieved by airport fragmentation and flight clustering, narrowing down the solution space by removing much of the suboptimal part. High-quality solutions are consistently found throughout a range of problem sizes without having to fine-tune the parameters, with a time consumption scaling as the cube of the problem size. The basic approach should be easy to adapt to other applications (e.g., communication routing).
to an anomalous scaling of the Digital DXML library routines employed; the number of elementary operations does scale like N3f .
Airline Crew Scheduling
1599
References Durbin, R., & Willshaw, D. (1987). An analog approach to the traveling salesman problem using an elastic net method. Nature, 326, 689. Gisl´en, L., Soderberg, ¨ B., & Peterson, C. (1989). Teachers and classes with neural networks. International Journal of Neural Systems, 1, 167. Gisl´en, L., Soderberg, ¨ B., & Peterson, C. (1992). Complex scheduling with Potts neural networks. Neural Computation, 4, 805. Hopfield, J. J., & Tank, D. W. (1985). Neural computation of decisions in optimization problems. Biological Cybernetics, 52, 141. Ohlsson, M., Peterson, C., & Soderberg, ¨ B. (1993). Neural networks for optimization problems with inequality constraints—the knapsack problem. Neural Computation, 5, 331. Peterson, C., & Soderberg, ¨ B. (1989). A new method for mapping optimization problems onto neural networks. International Journal of Neural Systems, 1, 3. Press, W. P., Flannery, B. P., Teukolsky, S. A., & Vettering, W. T. (1986). Numerical recipes: The art of scientific computing. Cambridge: Cambridge University Press. Tank, D. W., & Hopfield, J. J. (1986). Simple neural optimization networks: An A/D converter, signal decision circuit, and a linear programming circuit. IEEE Transactions on Circuits and Systems, CAS-33, 533.
Received May 20, 1996; accepted November 27, 1996.
Communicated by John Platt
Online Learning in Radial Basis Function Networks Jason A. S. Freeman Centre for Cognitive Science, University of Edinburgh, Edinburgh EH8 9LW, U.K.
David Saad Department of Computer Science and Applied Mathematics, University of Aston, Birmingham B4 7ET, U.K.
An analytic investigation of the average case learning and generalization properties of radial basis function (RBFs) networks is presented, utilizing online gradient descent as the learning rule. The analytic method employed allows both the calculation of generalization error and the examination of the internal dynamics of the network. The generalization error and internal dynamics are then used to examine the role of the learning rate and the specialization of the hidden units, which gives insight into decreasing the time required for training. The realizable and some overrealizable cases are studied in detail: the phase of learning in which the hidden units are unspecialized (symmetric phase) and the phase in which asymptotic convergence occurs are analyzed, and their typical properties found. Finally, simulations are performed that strongly confirm the analytic results. 1 Introduction Several tools facilitate the analytic investigation of learning and generalization in supervised neural networks, such as the statistical physics methods (see Watkin, Rau, & Biehl, 1993, for a review), the Bayesian framework (MacKay, 1992), and the “probably approximately correct” (PAC) method (Haussler, 1994). These tools have principally been applied to simple networks, such as linear and boolean perceptrons, and various simplifications of the committee machine (see, for instance, Schwarze, 1993, and references therein). It has proved very difficult to obtain general results for the commonly used multilayer networks, such as the sigmoid multilayer perceptron (MLP) and the radial basis function (RBF) network. Another approach, based on studying the dynamics of online gradient descent training scenarios, has been used by several authors (Heskes & Kappen, 1991; Leen & Orr, 1994; Amari, 1993) to examine the evolution of system parameters, primarily in the asymptotic regime. A similar approach, based on examining the dynamics of overlaps between characteristic sysNeural Computation 9, 1601–1622 (1997)
c 1997 Massachusetts Institute of Technology °
1602
Jason A. S. Freeman and David Saad
tem vectors in online training scenarios, has been suggested recently (Saad & Solla, 1995a, 1995b) for investigating the learning dynamics in the soft committee machine (SCM) (Biehl & Schwarze, 1995). This approach provides a complete description of the learning process, formulated in terms of the overlaps between vectors in the system, and it can be easily extended to include general two-layer networks (Riegler & Biehl, 1995). For RBFs, some analytic studies focus primarily on generalization error. In Freeman and Saad (1995a, 1995b), average case analyses are performed employing a Bayesian framework to study RBFs under a stochastic training paradigm. In Niyogi and Girosi (1994), a bound on generalization error is derived under the assumption that the training algorithm finds a globally optimal solution. Details of studies of RBFs from the perspective of the PAC framework can be found in Holden and Rayner (1995) and its references. These methods focus on a training scenario in which a model is trained on a fixed set of examples using a stochastic training method. This article presents a method for analyzing the behavior of RBFs in an online learning scenario whereby network parameters are modified after each presentation of an example, which allows the calculation of generalization error as a function of a set of variables characterizing the properties of the adaptive parameters of the network. The dynamical evolution of these variables in the average case can be found, allowing not only the investigation of generalization ability but also the internal dynamics of the network, such as specialization of hidden units, to be analyzed. This tool has previously been applied to MLPs (Saad & Solla, 1995a, 1995b; Rieler & Biehl, 1995). 2 Training Paradigms for RBF Networks RBF networks have been successfully employed over the years in many real-world tasks, providing a useful alternative to MLPs. Furthermore, the RBF is a universal approximator for continuous functions given a sufficient number of hidden units (Hartman, Keeler, & Kowalski, 1990). The RBF architecture consists of a two-layer fully connected network. The mapping performed by each hidden node represents a radially symmetric basis function; within this analysis, the basis functions are considered gaussian, and each is therefore parameterized by two quantities: a vector representing the position of the basis function center in input space and a scalar representing the width of the basis function. For simplicity, the output layer is taken to consist of a single node; this performs a linear combination of the hidden unit outputs. There are two commonly utilized methods for training RBFs. One approach is to fix the parameters of the hidden layer (both the basis function centers and widths) using an unsupervised technique such as clustering, setting a center on each data point of the training set, or even picking random values (for a review, see Bishop, 1995). Only the hidden-to-output
Online Learning
1603
weights are adaptable, which makes the problem linear in those weights. Although fast to train, this approach results in suboptimal networks since the basis function centers are set to fixed, suboptimal values. The alternative is to adapt the hidden layer parameters—either just the center positions or both center positions and widths. This renders the problem nonlinear in the adaptable parameters, and hence requires an optimization technique, such as gradient descent, to estimate these parameters. The second approach is computationally more expensive but usually leads to greater accuracy of approximation. This article investigates the nonlinear approach in which basis function centers are continuously modified to allow convergence to more optimal models. There are two methods in use for gradient descent. In batch learning, one attempts to minimize the additive training error over the entire data set; adjustments to parameters are performed once the full training set has been presented. The alternative approach, examined here, is online learning, in which the adaptive parameters of the network are adjusted after each presentation of a new data point.1 There has been a resurgence of interest analytically in the online method; technical difficulties caused by the variety of ways in which a training set of given size can be selected are avoided, so complicated techniques such as the replica method (Hertz, Krogh, & Palmer, 1989) are unnecessary. 3 Online Learning in RBF Networks We examine a gradient descent online training scenario on a continuous error measure. The trained model (student) is an RBF network consisting of K basis functions. The center of student basis function (SBF) b is denoted by mb , and the hidden-to-output weights of the student are represented by w. Training examples will consist of input-output pairs (ξ , ζ ). The components of ξ are uncorrelated gaussian random variables of mean 0, variance σξ2 , while ζ is generated by applying ξ to a deterministic teacher RBF, but one in which the number M and the position of the hidden units need not correspond to that of the student, which allows investigation of overrealizable and unrealizable cases.2 The mapping implemented by the teacher is denoted by fT and that of the student by fS . The hidden-to-output weights of the teacher are w0 , while the center of teacher basis function u is given by nu . The vector of SBF responses to input vector ξ is represented by s(ξ ), and those of the teacher are denoted by t(ξ ). The overall functions computed by
1
Obviously one may employ a method that is a compromise between the two extremes. This represents a general training scenario since, being universal approximators, RBF networks can approximate any continuous mapping to a desired degree. 2
1604
Jason A. S. Freeman and David Saad
the networks are therefore:3 Ã
K X
kξ − mb k2 fS (ξ ) = wb exp − 2σB2 b=1 fT (ξ ) =
M X
à w0u exp
u=1
kξ − nu k2 − 2σB2
! = w · s(ξ )
(3.1)
= w0 · t(ξ )
(3.2)
!
N will denote the dimensionality of input space and P the number of examples presented. The centers of the basis functions (input-to-hidden weights) and the hidden-to-output weights are considered adjustable; for simplicity, the widths of the basis functions are fixed to a common value σB . The evolution of the centers of the basis functions is described in terms of the overlaps Qbc ≡ mb · mc , Rbu ≡ mb · nu , and Tuv ≡ nu · nv , where Tuv is constant and describes characteristics of the task to be learned. Previous work in this area (Biehl & Schwarze, 1995; Saad & Solla, 1995a, 1995b; Riegler & Biehl, 1995) has relied on the thermodynamic limit.4 This limit allows one to ignore fluctuations in the updates of the means of the overlaps due to the randomness of the training examples, and permits the difference equations of gradient descent to be considered as differential equations. The thermodynamic limit is hugely artificial for local RBFs; as the activation is localized, the N → ∞ limit implies that a basis function responds only in the vanishingly unlikely event that an input point falls exactly on its center; there is no obvious reasonable rescaling of the basis functions.5 The price paid for not taking this limit is that one has no a priori justification for ignoring the fluctuations in the update of the adaptive parameters due to the randomness of the training example. In this work, we calculate both the means and variances of the adaptive parameters, showing that the fluctuations are practically negligible (see section 5). 3.1 Calculating the Generalization Error. Generalization error measures the average dissimilarity over input space between the desired mapping fT 3 Indices b, c, d, and e will always represent SBFs; u and v will represent those of the teacher. 4 P → ∞, N → ∞, and P/N = α, where α is finite. 5 For instance, utilizing
µ exp −
kξ − mb k2
¶
2NσB2
eliminates all directional information as the cross-term ξ · mb vanishes in the thermodynamic limit.
Online Learning
1605
and that implemented by the learning model fS . This dissimilarity is taken as quadratic deviation: À ¿ ¤2 1£ fS − fT , (3.3) EG = 2 where h· · ·i denotes an average over input space with respect to the measure p(ξ ). Substituting the definitions of equations 3.1 and 3.2 into this leads to: ) ( X X 1 X 0 0 0 wb wc hsb sc i + wu wv htu tv i − 2 wb wu hsb tu i . (3.4) EG = 2 bc uv bu Since the input distribution is gaussian, the averages are gaussian integrals and can be performed analytically; the resulting expression for generalization error is given in the appendix. Each average has dependence on combinations of Q, R, and T depending on whether the averaged basis functions belong to student or teacher. 3.2 System Dynamics. Expressions for the time evolution of the overlaps Q and R can be derived by employing the gradient descent rule, η p+1 p δb (ξ − mb ), mb = mb + NσB2 where δb = ( fT − fS )wb sb and η is the learning rate, which is explicitly scaled with 1/N: iE η Dh p p p p δb (ξ − mb ) · mc + δc (ξ − mc ) · mb (3.5) h 1Qbc i = NσB2 !2 Ã E D η p p δb δc (ξ − mb ) · (ξ − mc ) + 2 NσB h 1Rbu i =
η p h δb (ξ − mb ) · nu i. NσB2
(3.6)
The hidden-to-output weights can be treated similarly, but here the learning rate is scaled with 1/K, yielding:6 h 1wb i =
η h ( fT − fS )sb i. K
(3.7)
These averages are again gaussian integrals, so they can be carried out analytically. The averaged expressions for 1Q, 1R, and 1w are given in the appendix. 6 For simplicity we use the same learning rate for both the centers and the hidden-tooutput weights, although different learning rates may be employed.
1606
Jason A. S. Freeman and David Saad
By iterating equations 3.5, 3.6, and 3.7, the evolution of the learning process can be tracked. This allows one to examine facets of learning such as specialization of the hidden units. Since generalization error depends on Q, R, and w, one can also use these equations with equation 3.4 to track the evolution of generalization error. 4 Analysis of Learning Scenarios 4.1 The Evolution of the Learning Process. Solving the difference equations 3.5, 3.6, and 3.7 iteratively, one obtains solutions to the mean behavior of the overlaps and the weights. There are four distinct phases in the learning process, which are described with reference to an example of learning an exactly realizable task. The task consists of three SBFs learning a graded teacher of three teacher basis functions (TBFs) where graded implies that the square norms of the TBFs (diagonals of T) differ from one another; for this task, T00 = 0.5, T11 = 1.0, and T22 = 1.5. In this demonstration the teacher is chosen to be uncorrelated, so the offdiagonals of T are 0, and the teacher hidden-to-output weights w0 are set to 1. The learning process is illustrated in Figure 1. Figure 1a (solid curve) shows the evolution of generalization error, calculated from equation 3.4, while Figures 1b–d show the evolution of the equations for the means of R, Q, and w, respectively, calculated by iterating equations 3.5, 3.6, and 3.7 from random initial conditions sampled from the following uniform distributions: Qbb and wb are sampled from U[0, 0.1], while Qbc,b6=c and Rbc from a uniform distribution U[0, 10−6 ]. These initial conditions will be used throughout the article and reflect random correlations expected by arbitrary initialization of large systems. Input dimensionality N = 8, learning rate η = 0.9, input variance σξ2 = 1, and basis function width σB2 = 1 will be employed unless stated otherwise. Initially, there is a short transient phase in which the overlaps and hiddento-output weights evolve from their initial conditions until they reach an approximately steady value (P = 0 to P = 1000). The symmetric phase then begins, which is characterized by a plateau in the evolution of the generalization error (see Figure 1a, solid curve; P = 1000 to P = 7000), corresponding to a lack of differentiation among the hidden units; they are unspecialized and learn an average of the hidden units of the teacher, so that the student center vectors and hidden-to-output weights are similar (see Figures 1b–d). The difference in value between the overlaps R between student center vectors and teacher center vectors (see Figure 1b) is only due to the difference in the lengths of various teacher center vectors; if the overlaps were normalized, they would be identical. The symmetric phase is followed by a symmetry-breaking phase in which the SBFs learn to specialize, and become differentiated from one another (P = 7000 to P = 20,000). Finally there is a long convergence phase, as the overlaps and hidden-tooutput weights reach their asymptotic values. Since the task is realizable,
Online Learning
1607
0.0040
1.5
η = 0.1 η = 0.9 η = 5.0
0.0030
1.0
Eg
0.5
0.0020
R 0.0
0.0010
R00 R 01 R02
-0.5
0.0000
0
10000
20000
P
30000
40000
50000
-1.0
0
10000
(a)
20000
P
R10 R 11 R 12 30000
R20 R21 R22 40000
50000
(b)
1.5 1.0 1.0
W
0.5
Q
W1
-0.5
-1.0
W0
0.5
0.0
Q 00 Q 01 0
10000
Q02 Q11
20000
(c)
P
30000
W2
Q12 Q22 40000
50000
0.0
0
10000
20000
P
30000
40000
50000
(d)
Figure 1: The exactly realizable scenario with positive TBFs. Three SBFs learn a graded, uncorrelated teacher of three TBFs with T00 = 0.5, T11 = 1.0, and T22 = 1.5. All teacher hidden-to-output weights are set to 1. (a) The evolution of the generalization error as a function of the number of examples for several different learning rates (η = 0.1, 0.9, 5.0). (b, c) The evolution of overlaps between student and teacher center vectors and among student center vectors, respectively. (d) The evolution of the mean hidden-to-output weights.
this phase is characterized by Eg → 0 (see Figure 1a, solid curve) and by the student center vectors and hidden-to-output weights approaching those of the teacher (i.e., Q00 = R00 = 0.5, Q11 = R11 = 1.0, Q22 = R22 = 1.5, with the off-diagonal elements of both Q and R being zero; ∀b, wb = 1).7 These phases are generic in that they are observed, sometimes with some variation such as a series of symmetric and symmetry-breaking phases, in every online learning scenario for RBFs so far examined. They also correspond to the phases found for MLPs (Saad & Solla, 1995b; Riegler & Biehl, 1995). 7
The arbitrary labels of the SBFs were permuted to match those of the teacher.
1608
Jason A. S. Freeman and David Saad
The formalism describes the evolution of the means (and the variances) from certain initial conditions. Convergence of the dynamics to suboptimal attractive fixed points (local minima) may occur if the starting point is within the corresponding basin of attraction. No local minima have been observed in our solutions, which may be an artifact of the system dimensionality. 4.2 The Role of the Learning Rate. With all the TBFs positive, analysis of the time evolution of the generalization error, overlaps, and hidden-tooutput weights for various settings of the learning rate reveal the existence of three distinct behaviors. If η is chosen to be too small (here, η = 0.1), there is a long period in which there is no specialization of the SBFs and no improvement in generalization ability. The process becomes trapped in a symmetric subspace of solutions; this is the symmetric phase. Given asymmetry in the student initial conditions (in R, Q, or w) or of the task itself, this subspace will always be escaped, but the time period required may be prohibitively large (see Figure 1a, dotted curve). The length of the symmetric phase increases with the symmetry of the initial conditions. At the other extreme, if η is set too large, an initial transient takes place quickly, but there comes a point from which the student vector norms grow extremely rapidly, until the point where, due to the finite variance of the input distribution and local nature of the basis functions, the SBFs are no longer activated during training (see Figure 1a, dashed curve, with η = 5.0). In this case, the generalization error approaches a finite value as P → ∞, and the task is not solved. Between these extremes lies a region in which the symmetric subspace is escaped quickly, and EG → 0 as P → ∞ for the realizable case (see Figure 1a, solid curve, with η = 0.9). The SBFs become specialized and, asymptotically, the teacher is emulated exactly. These results for the learning rate are qualitatively similar to those found for SCMs and MLPs (Biehl & Schwarze, 1995; Saad & Solla, 1995a, 1995b; Riegler & Biehl, 1995). 4.3 Task Dependence. The symmetric phase depends on the symmetry of the task as well as that of the initial conditions. One would expect a shorter symmetric phase in inherently asymmetric tasks. To examine this, a task similar to that of section 4.1 was employed, with the single change being that the sign of one of the teacher hidden-to-output weights was flipped, thus providing two categories of targets: positive and negative. The initial conditions of the student remained the same as in the previous task, with η = 0.9. The evolution of generalization error and the overlaps for this task are shown in Figure 2. The dividing of the targets into two categories effectively eliminates the symmetric phase; this can be seen by comparing the evolution of the generalization error for this task (see Figure 2a, dashed curve) with that for the previous task (see Figure 2a, solid curve). There is no longer a plateau in the generalization error. Correspondingly, the symmetries be-
Online Learning
1609
0.0040
1.5
Positive Targets Pos/Neg Targets
0.0030
1.0
Eg
0.5
0.0020
R 0.0
0.0010
R00 R 01 R02
-0.5
0.0000
0
10000
20000
(a)
P
30000
40000
50000
-1.0
0
10000
20000
P
R10 R 11 R 12 30000
R20 R21 R22 40000
50000
(b)
Figure 2: The exactly realizable scenario defined by a teacher network with a mixture of positive and negative TBFs. Three SBFs learn a graded, uncorrelated teacher of three TBFs with T00 = 0.5, T11 = 1.0, and T22 = 1.5. w00 = 1, w01 = −1, w02 = 1. (a) The evolution of the generalization error for this case and, for comparison, the evolution in the case of all positive TBFs. (b) The evolution of the overlaps between student and teacher centers R.
tween SBFs break immediately, as can be seen by examining the overlaps between student and teacher center vectors (see Figure 2b); this should be compared with figure 1b, which denotes the evolution of the overlaps in the previous task. Note that the plateaus in the overlaps (see Figure 1b, P = 1000 to P = 7000) are not found for the antisymmetric task. The elimination of the symmetric phase is an extreme result caused by the extremely asymmetric teacher. For networks with many hidden units, one can find a cascade of subsymmetric phases, each shorter than the single symmetric phase in the corresponding task with only positive targets, in which there is one symmetry between the hidden units seeking positive targets and another between those seeking negative targets. This suggests a simple and easily implemented strategy for increasing the speed of learning when targets are predominantly positive (negative): eliminate the bias of the training set by subtracting (adding) the mean target from each target point. This corresponds to an old heuristic among RBF practitioners. It follows that the hidden-to-output weights should be initialized from a zero-mean distribution. 4.4 The Overrealizable Case. In real-world problems, the exact form of the data-generating mechanism is rarely known. This leads to the possibility that the student may be overly powerful, in that it is capable of fitting surfaces more complicated than that of the true teacher. It is important to gain insight into how architectures will respond given such a scenario
1610
Jason A. S. Freeman and David Saad
0.0020
2.0
Exactly Realizable, Task 1 Over-Realizable, Task 1 Exactly Realizable, Task 2 Over-realizable, Task 2
0.0015
1.5
Eg
R
0.0010
1.0
0.0005
0.5
0.0000
0
10000
20000
P
30000
40000
R01 R10 R22 R40
0.0
50000
0
10000
(a)
20000
P
30000
40000
50000
(b) 1.0
0.5
W 0.0
W0 W1 W2 W3 W4
-0.5
-1.0
0
10000
20000
P
30000
40000
50000
(c)
Figure 3: The overrealizable scenario. (a) The evolution of the generalization error in two tasks; each task is learned by a well-matched student (exactly realizable) and an overly powerful student (overrealizable). (b, c) The evolution of the overlaps R and the hidden-to-output weights w for the overrealizable case in the second task, in which the teacher RBF includes a mixture of positive and negative hidden-to-output weights. In this scenario, five SBFs learn a graded, uncorrelated teacher of three TBFs with T00 = 0.5, T11 = 1.0, and T22 = 1.5. w00 = 1, w01 = −1, w02 = 1.
in order to be confident that they can be used successfully when the true teacher is unknown. Intuitively, one might expect that a student that is well matched to the teacher will learn faster than one that is overly powerful. Figure 3a shows two tasks, each of which compares the overrealizable scenario with the well-matched case. The first task, consisting of three TBFs, is identical to that detailed in section 4.1, and hence has only positive targets. The performance of a well-matched student of three SBFs is compared with an overrealizable scenario in which five SBFs learn the three TBFs. Comparison of the evolution of generalization error between these learning scenarios
Online Learning
1611
is shown in Figure 3a; the solid curve represents the well-matched scenario, and the dot-dash curve illustrates the overrealizable scenario. The length of the symmetric phase is significantly increased with the overly powerful student. The length of the convergence phase is also increased. An analytical treatment of these effects as well as the overrealizable scenario generally is given elsewhere (Freeman & Saad, in press). The second task deals with the alternative scenario in which one TBF has a negative hidden-to-output weight; the task is identical to that defined in section 4.3, and the student initial conditions are again as specified in section 4.1. In Figure 3a the evolution of generalization error for both the overrealizable scenario (dashed curve) in which five SBFs learn three TBFs, and the corresponding well-matched case in which three SBFs learn three TBFs (dotted curve), is shown. There is no well-defined symmetric case due to the inherent asymmetry of the task. The convergence phase is again greatly increased in length; this appears to be a general feature of the overrealizable scenario. Given that the student is overly powerful, there appear to be, a priori, several remedies available to the student: eliminate the excess nodes, form cancellation pairs (in which two students exactly cancel one another), or devise more complicated fitting schemes. To examine the actual responses of the student, the evolution of the overlaps between student and teacher and of the hidden-to-output weights for the particular scenario described by the second trial detailed is presented in Figures 3b and 3c, respectively. Looking first at Figure 3c, it is apparent that w3 approaches zero (short-dashed curve), indicating that SBF 3 is entirely eliminated during training. Thus four SBFs remain to emulate three TBFs. The negative TBF 1 is exactly emulated by SBF 0, as T11 = 1, w01 = −1, and R01 = 1, w0 = −1 (solid curve on both Figures 3b and 3c), while, similarly, SBF 2 exactly emulates TBF 2 (long-dashed curve, both figures). This leaves SBF 1 and SBF 4 to emulate TBF 0. Looking at Figure 3c, dotted and dot-dash curves, both student hidden-to-output weights approach 0.5, exactly half that of the hidden-to-output weight of TBF 0; looking at Figure 3b, both SBFs have 0.5 overlap with TBF 0. This indicates that the sum of both students emulates TBF 0. Thus, elimination and fitting involving the noncancelling combination of nodes were found; in these trials and many others, no pairwise cancellation was found. One presumes that this could be induced by very careful selection of the initial conditions but that it is not found under normal circumstances. 4.5 Analysis of the Symmetric Phase. The symmetric phase, in which there is no specialization of the hidden units, can be analyzed in the realizable case by employing a few simplifying assumptions. It is a phenomenon that is predominantly associated with small η, so terms of η2 are neglected. The hidden-to-output weights are clamped to +1. The teacher is taken to be isotropic: TBF centers have identical norms of 1, each hav-
1612
Jason A. S. Freeman and David Saad
ing no overlap with the others; therefore Tuv = δuv . This has the result that the student norms Qbb are very similar in this phase, as are the studentstudent correlations, so Qbb ≡ Q and Qbc,b6=c ≡ C, where Q becomes the square norm of the SBFs, and C is the overlap between any two different SBFs. Following the geometric argument of Saad & Solla (1995b), in the symmetric phase, the SBF centers are confined to the subspace spanned by the TBF centers. Since Tuv = δuv , the SBF centers can be written in the orthonormal basis defined by the TBF centers, with the components being the overPM Rbu nu . Because the teacher is isotropic, the overlaps are laps R: mb = u=1 independent of both b and u and thus can be written in terms of a single parameter R. Further, this reduction to a single overlap parameter leads to Q = C = MR2 , so the evolution of the overlaps can be described as a single difference equation for R. The analytic solution of equations 3.5, 3.6, and 3.7 under these restrictions is still rather complicated. However, since we are primarily interested in large systems, that is, large K, we examine the dominant terms in the solution. Expanding in 1/K and discarding secondorder terms renders the system simple enough to solve analytically for the symmetric fixed point: R=
³
K 1+
1 σB2
−
σB2 exp
h³
1 2σB2
´
σB2 +1 σB2 +2
i´ .
(4.1)
The stability of the fixed point, and thus the breaking of the symmetric phase, can be examined by an eigenvalue analysis of the dynamics of the system near the fixed point. The method employed is similar to that detailed in Saad and Solla (1995b) and is presented elsewhere (Freeman & Saad, in press). The dominant eigenvalue (λ1 > 0) scales with K and represents a perturbation that breaks the symmetries between the hidden units; the remaining modes λi6=1 < 0, which also scale with K, are irrelevant because they preserve the symmetry. This result is in contrast to that for the SCM (Saad & Solla, 1995b), in which the dominant eigenvalue scales with 1/K. This implies that for RBFs, the more hidden units in the network, the faster the symmetric phase is escaped, resulting in negligible symmetric phases for large systems, while in SCMs the opposite is true. This difference is caused by the contrast between the localized nature of the basis function in the RBF network and the global nature of sigmoidal hidden nodes in SCM. In the SCM case, small perturbations around the symmetric fixed point result in relatively small changes in error since the sigmoidal response changes very slowly as one modifies the weight vectors. On the other hand, the gaussian response decays exponentially as one moves away from the center, so small perturbations around the symmetric fixed point result in massive changes that drive the symmetry breaking. When K increases, the
Online Learning
1613
error surface looks very rugged, emphasizing the peaks and increasing this effect, in contrast to the SCM case, where more sigmoids means a smoother error surface. This does not mean that the symmetric phase can be ignored for realistically sized networks, however. Even with a teacher that is not particularly symmetric, this phase can play a significant role in the learning dynamics. To demonstrate this, a teacher RBF of 10 hidden units with N = 5 was constructed with the teacher centers generated from a gaussian distribution N [0, 0.5]. Note that this teacher must be correlated because the number of centers is larger than the input dimension. A student network, also of 10 hidden units, was constructed with all weights initialized from N [0, 0.05]. The networks were then mapped into the corresponding overlaps, and the learning process was run with η = 0.1. The evolution of generalization error is shown in Figure 4d: the symmetric phase, extending here from P = 2000 to P = 15,000, is a prominent phenomenon of the learning dynamics. It is not merely an artifact of a highly symmetric teacher configuration (the teacher was random and correlated) or of a specially chosen set of initial conditions, as the student was initialized with realistic initial conditions before being mapped into overlaps.
4.6 Analysis of the Convergence Phase. To gain insight into the convergence of the online gradient descent process in a realizable scenario, a similar simplified learning scenario to that used in the symmetric phase analysis was employed. The hidden-to-output weights are again fixed to +1, and the teacher is defined by Tuv = δuv . The scenario can be extended to adaptable hidden-to-output weights (this is presented in Freeman & Saad, in press, along with more mathematical detail). As in the symmetric phase, the fact that Tuv = δuv allows the system to be reduced to four adaptive quantities: Q ≡ Qbb , C ≡ Qbc,b6=c , R ≡ Rbb , and S ≡ Rbc,b6=c . Linearizing this system about the known fixed point of the dynamics, Q = 1, C = 0, R = 1, S = 0, yields an equation of the form 1x = Ax, where x = {1 − R, 1 − Q, S, C} is the vector of deviations from the fixed point. The eigenvalues of the matrix A control the converging system; these are presented in Figure 4a for K = 10. In every case examined, there is a single critical eigenvalue λc that controls the stability and convergence rate of the system (shown in bold), a nonlinear subcritical eigenvalue, and two subcritical linear eigenvalues. The value of η at λc = 0 determines the maximum learning rate for convergence to occur; for λc > 0 the fixed point is unstable. The convergence of the overlaps is controlled by the critical eigenvalue; therefore, the value of η at the single minimum of λc determines the optimal learning rate (ηopt ) in terms of the fastest convergence of the system to the fixed point. Examining ηc and ηopt as a function of K (see Figure 4b), one finds that both quantities scale as 1/K; the maximum and optimal learning rates are
1614
Jason A. S. Freeman and David Saad
4.0
Maximum and Optimal Learning Rates
0.000
Eigenvalues for the Asymptotic Phase
-0.005
3.0
η
λ
ηc
2.0
-0.010
-0.020 0.0
η opt
1.0
-0.015
1.0
2.0
3.0
η
4.0
0.0 0.10
0.08
0.06
0.04
0.02
0.00
1/K
(a)
(b)
20.0
0.030
Maximum Learning Rate versus Basis Function Width
16.0
Generalization Error for a Realistic Learning Scenario
0.025 0.020
Eg
12.0
ηc
σξ2
0.015
8.0 0.010 4.0
0.0 0.0
0.005
1.0
2.0
σB2
(c)
3.0
4.0
5.0
0.000
0
20000
40000
60000
P
(d)
Figure 4: Convergence and symmetric phases. (a) The eigenvalues controlling the dynamics of the system for the convergence phase (detailed in section 4.6), linearized about the asymptotic fixed point in the realizable case, as a function of η. The critical eigenvalue is shown in bold. (b) The maximum and optimal convergence phase learning rates, found from the critical eigenvalue; these quantities scale as 1/K. (c) The maximum convergence phase learning rate as a function of basis function width. (d) The evolution of generalization error for a realistically sized learning scenario (described in section 4.5), demonstrating that the symmetric phase can play a significant role, even with a correlated, asymmetric teacher.
inversely proportional to the number of hidden units of the student. Numerically, the ratio of ηopt to ηc is approximately two-thirds. Finally, the relationship between basis function width and ηc is plotted in Figure 4c. When the widths are small, ηc is very large as it becomes unlikely that a training point will activate any of the basis functions. For σB2 > σξ2 , ηc ∼ 1/σB2 .
Online Learning
1615
5 Quantifying the Variances Because the thermodynamic limit is not employed, it is necessary to quantify the variances in the adaptive parameters to justify considering only the mean updates.8 By making assumptions as to the form of these variances, it is possible to derive equations describing their evolution. Specifically, it is assumed that each update function and parameter being updated can be written in terms of a mean and fluctuation; for instance, applying this to Qbc : q ηd [ (5.1) 1Qbc = 1Qbc + 1Q bc Qbc = Qbc + N Qbc , d where hQbc i denotes an average value and Q bc represents a fluctuation due to the randomness of the example. Combining these equations and averaging with respect to the input distribution results in a set of difference equations describing the evolution of the variances of the overlaps and hidden-to-output weights (similar to Riegler & Biehl, 1995) as training proceeds. Details of the method can be found in Heskes and Kappen (1991) and in Barber, Saad, and Sollich (1996) for the SCM. It has been shown that the variances vanish in the thermodynamic limit for realizable cases (Barber et al., 1996; Heskes & Kappen, 1991). (A detailed description of the calculation of the variances as applied to RBFs appears in Freeman & Saad, in press.) Figure 5 shows the evolution of the variances, as error bars on the mean, for the dominant overlaps and the hidden-to-output weights using η = 0.9, N = 10 on a task identical to that described in section 4.1. Examining the dominant overlaps R first (see Figure 5a), the variances follow the same pattern for each overlap but at different values of P. The variances begin at 0, then increase, peaking at the symmetry-breaking point at which the SBF begins to specialize on a particular TBF; then they decrease to 0 again as convergence occurs. Looking at each SBF in turn, for SBF 2 (dashed curve), the overlap begins to specialize at approximately P = 2000, where the variance peak occurs; for SBF 0 (solid curve), the symmetry lasts until P = 10,000, again where the variance peak occurs, and for SBF 1 (dotted curve), the symmetry breaks later at approximately P = 20,000, again where the peak of the variance occurs. The variances then dwindle to 0 for each SBF in the convergence phase. Essentially the same pattern occurs for the hidden-to-output weights (see Figure 5b). The variances increase rapidly until the hidden units begin 8 The hidden-to-output weights are not intrinsically self-averaging even in the thermodynamic limit, although they have been shown to be such for the MLP if the learning rate is scaled with N (Riegler & Biehl, 1995). If scaled differently, adiabatic elimination techniques may be employed to describe the evolution adequately (Riegler, personal communication, 1996).
1616
Jason A. S. Freeman and David Saad
1.1
2.0
R01 R22 R10
1.5
1.0
0.9
R
W
1.0
0.8
W0 W1
0.5 0.7
0.0
0
10000 20000 30000 40000 50000 60000
P
(a)
0.6
W2 0
10000 20000 30000 40000 50000 60000
P
(b)
Figure 5: Evolution of the variances of the overlaps R and hidden-to-output weights w are shown in (a) and (b), respectively. The curves denote the evolution of the means; the error bars show the evolution of the fluctuations about the mean. Input dimensionality N = 10, learning rate η = 0.9, input variance σξ2 = 1, and basis function width σB2 = 1.0.
to specialize, at which point the variances peak. This is followed by the variances’ decreasing to 0 as convergence occurs. For both overlaps and hidden-to-output weights, the mean is an order of magnitude larger than the standard deviation at the variance peak and is much more dominant elsewhere; the ratio becomes greater as N is increased. The magnitude of the variances is influenced by the degree of symmetry of the initial conditions of the student and the task in that the greater this symmetry is, the larger the variances. Discussion of this phenomenon can be found in Barber et al. (1996); it will be explored at greater length for RBFs in a future publication. 6 Simulations In order to confirm the validity of the analytic results, simulations were performed in which RBFs were trained using online gradient descent. The trajectories of the overlaps were calculated from the trajectories of the weight vectors of the network, and generalization error was estimated by finding the average error on a 1000-point test set. The procedure was performed 50 times and the results averaged, subject to permutation of the labels of the SBFs to ensure the average was meaningful. Typical results are shown in Figure 6. The example shown is for an exactly realizable system of three SBFs and three TBFs at N = 5, η = 0.9. Figure 6a shows the correspondence between empirical test error and theoretical generalization error. At all times, the theoretical result is within
Online Learning
1617
0.0020
1.5
Theoretical Empirical
0.0015
1.0
R
Eg 0.0010
0.5
0.0005
0.0
0.0000
-0.5
Empirical
Theoretical 0
2000
4000
P
6000
8000
10000
0
2000
(a)
P
6000
8000
10000
(b)
1.5
1.2
1.0
1.0
Q
W
0.5
0.8
0.0
0.6
0
2000
4000
(c)
P
Theoretical
Empirical
Theoretical -0.5
4000
6000
8000
10000
0.4
0
2000
4000
P
6000
Empirical 8000
10000
(d)
Figure 6: Comparison of theoretical results with simulations. The simulation results are averaged over 50 trials; the labels of the student hidden units were permuted where necessary to make the averages meaningful. Empirical generalization error was approximated with the test error on a 1000-point test set (a). Error bars on the simulations are at most the size of the larger asterisks for the overlaps (b, c), and at most twice this size for the hidden-to-output weights (d). Input dimensionality N = 5, learning rate η = 0.9, input variance σξ2 = 1, and basis function width σB2 = 1.
one standard deviation of the empirical result. Figures 6b–d show the excellent correspondence between the trajectories of the theoretical overlaps and hidden-to-output weights and their empirical counterparts; the error bars on the simulation distributions are not shown because they are approximately the size of the symbols. The simulations demonstrate the validity of the theoretical results. In addition, we have found excellent correlation between the analytically calculated variances and those obtained from the simulations (this is explored further in Freeman & Saad, in press).
1618
Jason A. S. Freeman and David Saad
7 Conclusion Online learning, in which the adaptive parameters of the network are updated at each presentation of a data point, was examined for the RBF using gradient descent learning. The analytic method presented allows the calculation of the evolution of generalization error and the specialization of the hidden units. This method was used to elucidate the stages of training and the role of the learning rate. There are four stages of training: a short transitory phase in which the adaptive parameters move from the initial conditions to the symmetric phase; the symmetric phase itself, characterized by lack of differentiation among hidden units; a symmetry-breaking phase in which the hidden units become specialized; and a convergence phase in which the adaptive parameters reach their final values asymptotically. Three regimes were found for the learning rate: small, giving unnecessarily slow learning; intermediate, leading to fast escape from the symmetric phase and convergence to the correct target; and too large, which results in a divergence of SBF norms and failure to converge to the correct target. Examining the exactly realizable scenario, it was shown that employing both positive and negative targets leads to much faster symmetry breaking; this appears to be the underlying reason behind the neural network folklore that targets should be given zero mean. The overrealizable case was also studied, showing that overrealizability extends both the length of the symmetric phase and that of the convergence phase. The symmetric phase for realizable scenarios was analyzed and the value of the overlaps at the symmetric fixed point found. It was discovered that there is a significant difference between the behaviors of the RBF and SCM, in that increasing K speeds up the symmetry-breaking in RBFs, while it slows the process for SCMs. The convergence phase was also studied; both maximum and optimal learning rates were calculated and shown to scale as 1/K. The dependence of the maximum learning rate on the width of the basis functions was also examined, and, for σB2 > σξ2 , the maximum learning rate scales approximately as 1/σB2 . Finally, simulations were performed that strongly confirm the theoretical results. Future work includes the study of unrealizable cases, in which the learning rate must decay over time in order to find a stable solution, the study of the effects of noise and regularizers, the extension of the analysis of the convergence phase to fully adaptable hidden-to-output weights, and the use of the theory to aid real-world learning tasks, by, for instance, deliberately breaking the symmetries between SBFs in order to reduce drastically or even eliminate the symmetric phase.
Online Learning
1619
Appendix Generalization Error ( ) X X 1 X 0 0 0 EG = wb wc I2 (b, c) + wu wv I2 (u, v) − 2 wb wu I2 (b, u) (A.1) 2 bc uv bu 1Q, 1R, and 1w
i io h η n h J (b; c) − Q I (b) + w J (c; b) − Q I (c) w 2 c 2 b bc bc 2 2 NσB2 !2 Ã n η wb wc K4 (b, c) + Qbc I4 (b, c) + 2 NσB o (A.2) − J4 (b, c; b) − J4 (b, c; c)
h1Qbc i =
o n η wb J2 (b; u) − Rbu I2 (b) 2 NσB
h1Rbu i =
η I2 (b) K
h1wb i = I, J, and K I2 (b) =
(A.3)
X
(A.4)
w0u I2 (b, u) −
u
w0u J2 (b, u; c) −
u
I4 (b, c) =
wd I2 (b, d)
(A.5)
d
X
J2 (b; c) =
X
X
wd J2 (b, d; c)
(A.6)
d
X
wd we I4 (b, c, d, e) +
de
−2
X
X
w0u w0v I4 (b, c, u, v)
uv
wd w0u I4 (b, c, d, u)
(A.7)
du
J4 (b, c; f ) =
X
wd we J4 (b, c, d, e; f ) +
de
−2
X
X
w0u w0v J4 (b, c, u, v; f )
uv
wd w0u J4 (b, c, d, u; f )
(A.8)
du
K4 (b, c) =
X
wd we K4 (b, c, d, e) +
de
−2
X du
X
w0u w0v K4 (b, c, u, v)
uv
wd w0u K4 (b, c, d, u)
(A.9)
1620
Jason A. S. Freeman and David Saad
I, J, and K. In each case, only the quantity corresponding to averaging over SBFs is presented. Each quantity has similar counterparts in which TBFs are substituted for SBFs. For instance, I2 (b, c) = hsb sc i is presented, and I2 (u, v) = htu tv i and I2 (b, u) = hsb tu i are omitted. I2 (b, c) = (2l2 σξ2 )−N/2 " # −Qbb − Qcc + (Qbb + Qcc + 2Qbc )/2σB2 l2 × exp 2σB2 Ã
!
J2 (b, c; d) =
Qbd + Qcd 2l2 σB2
I4 (b, c, d, e) =
(2l4 σξ2 )−N/2 exp
I2 (b, c) "
"
(A.10)
(A.11)
−Qbb − Qcc − Qdd − Qee 2σB2
#
Qbb +Qcc +Qdd +Qee +2(Qbc +Qbd +Qbe +Qcd +Qce +Qde ) × exp 4l4 σB4
#
(A.12) Ã J4 (b, c, d, e; f ) = Ã K4 (b, c, d, e) =
Qbf + Qcf + Qdf + Qef 2l4 σB2
! I4 (b, c, d, e)
(A.13)
2Nl4 σB4 + Qbb + Qcc + Qdd + Qee 4l4 σB4
! 2(Qbc + Qbd + Qbe + Qcd + Qce + Qde ) + I4 (b, c, d, e) 4l24 σB4 (A.14) Other Quantities l2 = l4 =
2σξ2 + σB2 2σB2 σξ2 4σξ2 + σB2 2σB2 σξ2
(A.15) (A.16)
Acknowledgments We thank Ansgar West and David Barber for useful discussions, and the anonymous referees for their comments. D.S. thanks the Leverhulme Trust for its support (F/250/K).
Online Learning
1621
References Amari, S. (1993). Backpropagation and stochastic gradient descent learning. Neurocomputing, 5, 185–196. Barber, D., Saad, D., & Sollich, P. (1996). Finite size effects in online learning of multilayer neural networks. Euro. Phys. Lett., 34, 151–156. Biehl, M., & Schwarze, H. (1995). Learning by online gradient descent. J. Phys. A: Math. Gen., 28, 643. Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Freeman, J., & Saad, D. (1995a). Learning and generalisation in radial basis function networks. Neural Computation, 7, 1000–1020. Freeman, J., & Saad, D. (1995b). Radial basis function networks: Generalization in overrealizable and unrealizable scenarios. Neural Networks, 9, 1521– 1529. Freeman, J., & Saad, D. (in press). The dynamics of on-line learning in radial basis function networks. Phys. Rev. A. Hartman, E., Keeler, J., & Kowalski, J. (1990). Layered neural networks with gaussian hidden units as universal approximators. Neural Computation, 2, 210–215. Haussler, D. (1994). The probably approximately correct (PAC) and other learning models. In A. Meyrowitz & S. Chipman (Eds.), Foundations of knowledge acquisition: Machine learning (Chap. 9). Boston: Kluwer. Hertz, J., Krogh, A., & Palmer, R. (1989). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Heskes, T., & Kappen, B. (1991). Learning processes in neural networks. Phys. Rev. A., 44, 2718–2726. Holden, S., & Rayner, P. (1995). Generalization and PAC learning: Some new results for the class of generalized single-layer networks. IEEE Trans. on Neural Networks, 6(2), 368–380. Leen, T., & Orr, G. (1994). Optimal stochastic search and adaptive momentum. In J. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, (6: 477–484). San Mateo, CA: Morgan Kaufmann. MacKay, D. (1992). Bayesian interpolation. Neural Computation, 4, 415–447. Niyogi, P., & Girosi, F. (1994). On the relationship between generalization error, hypothesis complexity and sample complexity for radial basis functions (Tech. Rep.). Cambridge, MA: AI Laboratory, MIT. Riegler, P., & Biehl, M. (1995). On-line backpropagation in two-layered neural networks. J. Phys. A: Math. Gen., 28, L507–513. Saad, D., & Solla, S. (1995a). Exact solution for online learning in multilayer neural networks. Phys. Rev. Lett., 74, 4337–4340. Saad, D., & Solla, S. (1995b). On-line learning in soft committee machines. Phys. Rev. E., 52, 4225–4243.
1622
Jason A. S. Freeman and David Saad
Schwarze, H. (1993). Learning a rule in a multilayer neural network. J. Phys. A: Math. Gen., 26, 5781–5794. Watkin, T., Rau, A., & Biehl, M. (1993). The statistical mechanics of learning a rule. Reviews of Modern Physics, 65, 499–556.
Received March 15, 1996; accepted January 3, 1997.
ARTICLE
Communicated by Pietro Perona
Minimax Entropy Principle and Its Application to Texture Modeling Song Chun Zhu Division of Applied Mathematics, Brown University, Providence, RI 02912, U.S.A.
Ying Nian Wu Department of Statistics, University of Michigan, Ann Arbor, MI 48109, U.S.A.
David Mumford Division of Applied Mathematics, Brown University, Providence, RI 02912, U.S.A.
This article proposes a general theory and methodology, called the minimax entropy principle, for building statistical models for images (or signals) in a variety of applications. This principle consists of two parts. The first is the maximum entropy principle for feature binding (or fusion): for a given set of observed feature statistics, a distribution can be built to bind these feature statistics together by maximizing the entropy over all distributions that reproduce them. The second part is the minimum entropy principle for feature selection: among all plausible sets of feature statistics, we choose the set whose maximum entropy distribution has the minimum entropy. Computational and inferential issues in both parts are addressed; in particular, a feature pursuit procedure is proposed for approximately selecting the optimal set of features. The minimax entropy principle is then corrected by considering the sample variation in the observed feature statistics, and an information criterion for feature pursuit is derived. The minimax entropy principle is applied to texture modeling, where a novel Markov random field (MRF) model, called FRAME (filter, random field, and minimax entropy), is derived, and encouraging results are obtained in experiments on a variety of texture images. The relationship between our theory and the mechanisms of neural computation is also discussed.
1 Introduction This article proposes a general theory and methodology, the minimax entropy principle, for statistical modeling in a variety of applications. This section introduces the basic concepts of the minimax entropy principle after a discussion of the motivation of our theory and a brief review of some relevant theories and methods previously studied in the literature. Neural Computation 9, 1627–1660 (1997)
c 1997 Massachusetts Institute of Technology °
1628
Song Chun Zhu, Ying Nian Wu, and David Mumford
1.1 Motivation and Goal. In a variety of disciplines ranging from computational vision, pattern recognition, and image coding to psychophysics, an important theme is to pursue a probability model to characterize a set of images (or signals) I. This is often posed as a statistical inference problem: we assume that there exists a joint probability distribution (or density) f (I) over the image space; f (I) should concentrate on a subspace that corresponds to the ensemble of images in the application; and the objective is to estimate f (I) given a set of observed (or training) images. f (I) plays significant roles in the following areas: 1. Visual coding, where the goal is to take advantage of the regularity or redundancy in the input images to produce a compact coding scheme. This involves measuring the efficiency of coding schemes in terms of entropy (Watson, 1987; Barlow, Kaushal, & Mitchison, 1989), where the computation of the entropy and thus the choice of the optimal coding schemes depend on the estimation of f (I). For example, two kinds of coding schemes are compared in the recent work of Field (1994): the compact coding and the sparse coding. The former assumes gaussian distributions for f (I), whereas the latter assumes nongaussian ones. 2. Pattern recognition, neural networks, and statistical decision theory, where one often needs to find a probability model f (I) for each category of images of similar patterns. Thus, an accurate estimation of f (I) is a key factor for successful classification and recognition. 3. Computational vision, where f (I) is often adopted as a prior model in terms of Bayesian theory, and it provides a language for visual computation ranging from images segmentation to scene understanding (Zhu, 1996). 4. Texture modeling, where the objective is to estimate f (I) by a probability model p(I) for each set of texture images that have perceptually similar texture appearances. p(I) is important not only for texture analysis such as texture segmentation and classification, but also plays a role in texture synthesis since texture images can be synthesized by sampling p(I). Furthermore, the nature of the texture model helps us understand the mechanisms of human texture perception (Julesz, 1995). However, making inferences about f (I) is much more challenging than many of the learning problems in neural modeling (Dayan, Hinton, Neal, & Zernel, 1995; Xu, 1995) for the following reasons. First, the dimension of the image space is overwhelmingly large compared with the number of available training examples. In texture modeling, for instance, the size of images is often about 200 × 200 pixels, and thus f (I) is a function of 40, 000 variables, whereas we have access to only one or a few training images. This make it inappropriate to use nonparametric inference methods, such
Minimax Entropy Principle
1629
as kernel methods, radial basis functions (Ripley, 1996), and mixture of gaussian models (Jordan & Jacobs, 1994). Second, f (I) is often far from being gaussian; therefore some popular dimension-reduction techniques, such as the principal component analysis (Jolliffe, 1986) and spectral analysis (Priestley, 1981), do not appear to be directly applicable. As an illustration of the nongaussian property, Figure 1a shows the empirical marginal distribution (or histogram) of the intensity differences of horizontally adjacent pixels of some natural images (Zhu & Mumford, 1997). As a comparison, the gaussian distribution with the same mean and variance is plotted as a dashed curve in Figure 1a. Similar nongaussian properties are also observed in Field (1994). Another example is shown in Figure 1b, where the solid curve is the histogram of F ∗ I, with I being a texton image shown in Figure 8a, and F is a filter with the same texton (see section 4.5 for details). It is clear that the solid curve is far from being gaussian, and as a comparison, the dotted curve in Figure 1b is the histogram of F ∗ I, with I being a white noise image. The outliers in the histogram are perceptual features, not noise! 1.2 Previous Methods. A key issue in building a statistical model is the balance between generality and simplicity. The model should include rich structures to describe real-world images adequately and should be capable of modeling complexities due to high dimensionality and nongaussian property, and at the same time, it should be simple enough to be computationally feasible and give simple explanation to what we observe. To reduce complexity, it is often necessary to impose structures on the distribution. In the past, two main methods have been adopted in applications. The first method adopts some parametric Markov random field (MRF) models in the forms of Gibbs distributions—for example, the general smoothness models in image restoration (Geman & Geman, 1984; Mumford & Shah, 1989) and the conditional autoregression models in texture modeling (Besag, 1973; Cross & Jain, 1983). This method involves only a small number of parameters and thus constructs concise distributions for images. However, they do not achieve adequate generality for the following reasons. First, these MRF models can afford only small cliques; otherwise the number of parameters will explode. But these small cliques can hardly capture image features at relatively large scales. Second, the potential functions are of very limited and prespecified forms, whereas in practice it is often desirable for the forms of the distributions to be determined or learned from the observed images. The second method is widely used in visual coding and image reconstruction, where the high-dimensionality problem is avoided by representing the images with a relatively small set of feature statistics, and the latter are usually extracted by a set of well-selected filters. Examples of filters include the frequency and orientation selective Gabor filters (Daugman, 1985) and some wavelet pyramids based on various coding criteria (Mallat, 1989;
1630
Song Chun Zhu, Ying Nian Wu, and David Mumford
(a) 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 −15
−10
−5
0
5
10
15
(b) 0.12
0.1
Histogram H(F*I)
0.08
0.06
0.04
0.02
0
−200
−150
−100 −50 filter response F*I
0
50
Figure 1: (a) The histogram of intensity difference at adjacent pixels and gaussian curve (dashed) of same mean and variance in domain [−15, 15]. (b) Histogram of the filtered texton image (solid curve) and a filtered noise image (dotted curve).
Minimax Entropy Principle
1631
Simoncelli, Freeman, Adelson, & Weeger, 1992; Coifman & Wickerhauser, 1992; Donoho & Johnstone, 1994). The feature statistics extracted by a certain filter is usually the overall histogram of filtered images. These histograms are used for pattern classification, recognition, and visual coding (Watson, 1987; Donoho & Johnstone, 1994). Despite the excellent performances of this method, there are two major problems yet to be solved. The first is the feature binding or feature fusion problem: given a set of filters and their histograms, how to integrate them into a single probability distribution. This problem becomes much more difficult if the filters used are not all linear and are not independent of each other. The second problem is feature selection: for a given model complexity, how to choose a set of filters or features to characterize best the images being modeled. 1.3 Our Theory and Methodology. In this article, a minimax entropy principle is proposed for building statistical models, and it provides a new strategy to balance between model generality and model simplicity by two seemingly contrary criteria: maximizing entropy and minimizing entropy. (I). The Maximum Entropy Principle (Jaynes 1957). Without loss of generality, any image feature can be expressed as φ (α) (I), where φ (α) ( ) can be a vector-valued functions of the image intensities and α is the index of the features. The statistic of the feature φ (α) (I) is E f [φ (α) (I)], which is the expectation of φ (α) (I) with respect to f (I) and is estimated by the sample mean computed from the training images. Then, given a set of features S = {φ (α) , α = 1, 2, . . . , K}, a model p(I) is constructed such that it reproduces the feature statistics as observed; that is, Ep [φ (α) (I)] = E f [φ (α) (I)], for α = 1, 2, . . . , K. Among all model p(I) satisfying such constraints, the maximum entropy principle favors the simplest one in the sense that it has the maximum entropy. Since entropy is a measure of randomness, a maximum entropy (ME) model p(I) is considered as the simplest fusion or binding of the features and their statistics. (II). The Minimum Entropy Principle. The goodness of p(I) constructed in (I) is measured by KL( f, p), that is, the Kullback-Leibler divergence from f (I) to p(I) (Kullback & Leibler, 1951), and it depends on the feature set S that we selected. As we will show in the next section, KL( f, p) is, up to a constant, equal to the entropy of p(I). Thus, to estimate f (I) closely, we need to minimize the entropy of the ME distribution p(I) with respect to S, which often means that we should use as many features as possible to specify p(I). In this sense, a minimum entropy principle favors model generality. When the model complexity or the number of features K is limited, the minimum entropy principle provides a criterion for selecting the feature set S that best characterizes f (I). Computational procedures are proposed for parameter estimation and
1632
Song Chun Zhu, Ying Nian Wu, and David Mumford
feature selection. The minimax entropy principle is further studied in the presence of sample variation of feature statistics. As an example of application, the minimax entropy principle is applied to texture modeling, where the features are extracted by filters that are selected from a general filter bank, and the feature statistics are the empirical marginal distributions (usually further reduced to the histograms) of the filtered images. The resulting model, called FRAME (filters, random fields, and minimax entropy), is a new class of MRF model. Compared with previous MRF models, the FRAME model employs a much more enriched vocabulary and hence enjoys a much stronger descriptive ability, and at the same time, the model complexity is still under check. Texture images are synthesized by sampling the estimated models, and the correctness of estimated models is thus verified by checking whether the synthesized texture images have similar visual appearances to the observed images. The rest of the article is arranged as follows. Section 2 studies the minimax entropy principle, where algorithms are proposed for parameters estimation and feature selection. In Section 3 we study the minimax entropy principle in depth by correcting it in the presence of estimation error and addressing the issue of variance estimation in homogeneous random fields. Section 4 applies the minimax entropy principle to texture modeling. Section 5 concludes with a discussion of the texture model and the relationship between minimax entropy and neural modeling. 2 The Minimax Entropy Principle To fix notation, let I be an image defined on a domain D; for example, D can be an N × N lattice, for each point vE ∈ D, I(E v) ∈ L, and L is the range of image intensities. For a given application, we assume that there exists an underlying probability distribution (or density) f (I) defined on the image space L|D| , where |D| is the size of the image domain. Then the objective is to estimate f (I) based on a set of observed images {Iobs i , i = 1, . . . , M} sampled from f (I). 2.1 The Maximum Entropy Principle. At the initial stage of studying the regularity and variability of the observed images Iobs i , i = 1, 2, . . . , M, one often starts from exploring a finite set of essential features that are characteristic of the observations. Without loss of generality, such features are extracted by S = {φ (α) ( ), α = 1, 2, . . . , K}, where φ (α) (I) can be a vectorvalued function of the image intensities. The statistics of these features are estimated by the sample means,
µ(α) obs =
M 1 X φ (α) (Iobs i ), M i=1
for α = 1, . . . , K.
Minimax Entropy Principle
1633
If the large sample effect takes place (usually a necessary condition for modeling), then the sample averages {µ(α) obs , α = 1, . . . , K} make reasonable estimates for the expectations {E f [φ (α) (I)], α = 1, . . . , K}, where E f denotes the expectation with respect to f (I). We call {µ(α) obs , α = 1, . . . , K} the observed statistics and {E f [φ (α) (I)], α = 1, . . . , K} the expected statistics of f (I). To approximate f (I), a probability model p(I) is restricted to reproduce the observed statistics, that is, Ep [φ (α) (I)] = µ(α) obs for α = 1, . . . , K. Let (α) ÄS = {p(I): Ep [φ (α) (I)] = µ(α) ∈ S} obs , ∀φ
(2.1)
be the set of distributions that reproduce the observed statistics of feature set S; then we need to select a p(I) ∈ ÄS provided that ÄS 6= ∅. As far as the observed feature statistics {µ(α) obs , α = 1, . . . , K} are concerned, all the distributions in ÄS explain them equally well, and they are not distinguishable from f (I). The ME principle (Jaynes, 1957) suggests that we should choose p(I) that achieves the maximum entropy to obtain the purest and simplest fusion of the observed features and their statistics. The underlying philosophy is that while p(I) satisfies the constraints along some dimensions, it should be made as random (or smooth) as possible in other unconstrained dimensions, that is, p(I) should represent information no more than that is available and in this sense, the ME principle is often called the minimum prejudice principle. Thus we have the following constrained optimization problem, ½ Z ¾ p(I) = arg max − p(I) log p(I)dI , (2.2) subject to Ep [φ (α) (I)] = and
Z
φ (α) (I)p(I)dI = µ(α) obs ,
α = 1, . . . , K,
Z p(I)dI = 1.
By an application of the Lagrange multipliers, it is well known that the solution for p(I) has the following Gibbs distribution form: ( ) K X 1 exp − hλ(α) , φ (α) (I)i , (2.3) p(I; 3, S) = Z(3) α=1 where 3 = (λ(1) , λ(2) , . . . , λ(K) ) is the parameter, λ(α) is a vector of the same dimension as φ (α) (I), h· , ·i denotes inner product, and ) ( Z K X (α) (α) hλ , φ (I)i dI Z(3) = exp − α=1
1634
Song Chun Zhu, Ying Nian Wu, and David Mumford
is the partition function, which normalizes p(I; 3) into a probability distribution. 2.2 Estimation and Computation. Equation 2.3 specifies an exponential family of distributions (Brown, 1986), 2S = {p(I; 3, S) : 3 ∈ Rd },
(2.4)
ˆ which where d is the total number of parameters, and 3 is solved at 3, ˆ satisfies the constraints p(I; 3, S) ∈ ÄS , that is, [φ (α) (I)] = µ(α) Ep(I;3,S) ˆ obs . α = 1, . . . , K.
(2.5)
However, analytical solution of equation 2.5 is in general unavailable; inˆ S) iteratively from 2S by maximum likelihood stead, we solve for p(I; 3, estimator. 1 PM obs Let L(3, S) = M i=1 log p(Ii ; 3, S) be the log-likelihood function for any p(I; 3, S) ∈ 2S , and it has the following properties: 1 ∂Z ∂L(3, S)) (α) (α) =− − µ(α) ∀α, obs = Ep(I;3,S) [φ ] − µobs , ∂λ(α) Z ∂λ(α) ∂ 2 L(3, S) (β) 0 (α) (β) (I) − µ(α) (I) − µobs ) ], ∀α, β. 0 = Ep(I;3,S) [(φ obs )(φ ∂λ(α) λ(β)
(2.6) (2.7)
Following equation 2.6, maximizing the log likelihood by gradient ascent gives the following equation for solving 3 iteratively: dλ(α) = Ep(I;3,S) [φ (α) (I)] − µ(α) obs , α = 1, . . . , K. dt
(2.8)
ˆ Moreover, equation 2.7 means Obviously equation 2.8 converges to 3 = 3. that the Hessian matrix of L(3, S) is the covariance matrix (φ (1) P(I), . . . , φ (K) (I)) and thus is positive definite under the condition that a(0) + Kα=1 a(α) φ (α) (I) ≡ 0 H⇒ a(α) = 0 for α = 0, . . . , K, which is usually satisfied. So L(3, S) is strictly concave with respect to 3, and the solution for 3 uniquely exists. Following equations 2.5, 2.6, and 2.7, we have the following conclusion. ˆ S)} where 3 ˆ is both Proposition 1. Given a feature set S, ÄS ∩ 2S = {p(I; 3, the maximum entropy estimator and the maximum likelihood estimator. At each step t of equation 2.8, the computation of Ep(I;3,S) [φ (α) (I)] is in general difficult, and we adopt the stochastic gradient method (Younes 1988) for approximation. For a fixed 3, we synthesize some typical images
Minimax Entropy Principle
1635
syn
{Ii , i = 1, . . . , M0 } by sampling p(I; 3, S) with the Gibbs sampler (Geman & Geman, 1984) or other Markov chain Monte Carlo (MCMC) methods (Winkler, 1995), and approximate Ep(I;3,S) [φ (α) (I)] by the sample means; that is, Ep(I;3,S) [φ
(α)
(I)] ≈
µ(α) obs (3)
M0 1 X syn = 0 φ (α) (Ii ), α = 1, . . . , K. M i=1
(2.9)
Therefore the iterative equation for computing 3 becomes dλ(α) (α) = 1(α) (3) = µ(α) syn (3) − µobs , α = 1, . . . , K. dt
(2.10)
For the accuracy of the approximation in equation 2.9, the sample size M0 should be large enough. The data flow for parameter estimation is shown in Figure 2, and the details of the algorithm can be found in (Zhu, Wu, & Mumford, 1996). 2.3 The Minimum Entropy Principle. For now, suppose that the sample size M is large enough so that the expected feature statistics {E f [φ (α) (I)], α = 1, . . . , K} can be estimated exactly by neglecting the estimation errors in the ? observed statistics {µ(α) obs , α = 1, . . . , K}. Then an ME distribution p(I; 3 , S) is computed so that it reproduces the expected statistics of a feature set S = {φ (α) , α = 1, 2, . . . , K}; that is, Ep(I;3? ,S) [φ (α) (I)] = E f [φ (α) (I)], α = 1, . . . , K. Since our goal is to make an inference about the underlying distribution f (I), the goodness of this model can be measured by the Kullback-Leibler (Kullback & Leibler, 1951) divergence from f (I) to p(I; 3? , S), Z f (I) KL( f, p(I; 3? , S)) = dI f (I) log p(I; 3? , S) = E f [log f (I)] − E f [log p(I; 3? , S)]. For KL( f, p(I; 3? , S)), we have the following conclusion: Theorem 1. entropy( f ).
In the above notation, KL( f, p(I; 3? , S)) = entropy(p(I; 3? , S)) −
See the appendix for a proof. In the above result, entropy( f ) is fixed, and entropy(p(I; 3? , S)) depends on the set of features S included in the distribution p(I; 3? , S). Thus minimizing KL( f, p(I; 3? , S)) is equivalent to minimizing the entropy of p(I; 3? , S). We call this the minimum entropy principle, and it has the following intuitive interpretations. First, in information theory, p(I; 3? , S) defines a coding scheme with each I assigned a coding length − log p(I; 3? , S) (Shannon, 1948), and entropy (p(I; 3? , S)) = Ep [− log p(I; 3? , S)] stands for the
1636
Song Chun Zhu, Ying Nian Wu, and David Mumford obs
Ii
9 B
=S
=
f
( ) (I) g
f
?
Ef [( ) (I)]
d(( ) )
6
Ep [( ) (I)]
6
B
=S
=
y
f
=
S
?
k) (obs
?
?
?
D(1)
D(2)
6
6
:::
image analysis
-L
D(k)
= (
(1)
6 image synthesis
; (2) ; : : : (k) )
Ep [(2) (I)] : : : Ep [(k) (I)]
6
(2)
6 estimate
(k )
syn
syn
:::
syn
(1) (I)
(2) (I)
:::
(k) (I) g
6
( ) (I) g
:::
(k) (I) g
? estimate
( 1)
6
:::
?
6
( )
the true distribution
Ef [(2) (I)] : : : Ef [(k) (I)]
Ep [(1) (I)]
syn
f
?
2) (obs
?
f ( I)
sampling
z
(2) (I)
Ef [(1) (I)]
?
(1) (I) 1) (obs
?
R
?
) (obs
i=1;2;::: M
6
I
syn
Ii
i=1;2;:::M 0
6
:
=
S
?
p(I; L ) sampling by MCMC
the model
Figure 2: Data flow of the algorithm for model estimation and feature selection.
expected coding length. Therefore, a minimum entropy principle chooses the coding system with the shortest average coding length. The shortest ˆ S) is minus average coding length in the actually estimated model p(I; 3, its log likelihood in view of Proposition 2; hence minimizing entropy is the same as maximizing the likelihood of the data: ˆ S), then L(3, ˆ S) = −entropy Proposition 2. Given a feature set S and p(I; 3, ˆ ˆ (p(I; 3, S)) where 3 is the ML estimator. Proof. Since (α) [φ (α) (I)] = µ(α) ∈ S, Ep(I;3,S) ˆ obs , ∀φ
) ( M K X 1 X (α) (α) obs ˆ − ˆ S) = hλˆ , φ (Ii )i − log Z(3) L(3, M i=1 α=1
Minimax Entropy Principle
ˆ − = − log Z(3)
1637 K X α=1
ˆ − = − log Z(3)
K X α=1
hλˆ (α) , µ(α) obs i (α) hλˆ (α) , Ep(I;3) ˆ [φ (I)]i
ˆ S)). = −entropy(p(I; 3, However, to keep the model complexity under check, one often needs to fix the number of features K. To be precise, let B be the set of all possible features and S ⊂ B an arbitrary set of K features. Therefore entropy minimization provides a criterion for choosing the optimal set of features; that is, S∗ = arg min entropy(p(I; 3? , S)). |S|=K
(2.11)
According to the maximum entropy principle, p(I; 3? , S) = arg max entropy(p). p∈ÄS
(2.12)
Combining equations 2.11 and 2.12, we have S∗ = arg min {max entropy(p)}. |S|=K p∈ÄS
(2.13)
We call equation 2.13 the minimax entropy principle. We have demonstrated that this principle is consistent with the goal of modeling: Finding the best estimate for the underlying distribution f (I), and the relationship between minimax entropy and maximum likelihood estimator is addressed by Propositions 1 and 2. 2.4 Feature Pursuit. Enumerating all possible sets of features S ⊂ B and comparing their entropies is certainly impractical. Instead, we propose a greedy procedure to pursue the features in the following way.1 Start from an empty feature set ∅ and p(I) a uniform distribution, add to the model one feature at a time such that the added feature leads to the maximum decrease in the entropy of ME model p(I; 3? , S), and keep doing this until the entropy decrease is smaller than a certain value. To be precise, let S = {φ (α) , α = 1, . . . , K} be the currently selected set of features, and let ( ) K X 1 (α) (α) exp − hλ , φ (I)i p = p(I; 3, S) = Z(3) α=1
(2.14)
1 We use the word pursuit to represent the stepwise method and distinguish it from selection.
1638
Song Chun Zhu, Ying Nian Wu, and David Mumford
be the ME distribution fitted to f (I) (we omit ? from 3 for notational simplicity in this subsection). For any new feature φ (β) ∈ B /S, let S+ = S∪{φ (β) } be a new feature set. The new ME distribution is p+ = p(I; 3+ , S+ ) ( ) K X 1 (β) (α) (α) (β) = exp − hλ+ , φ (I)i − hλ+ , φ (I)i . Z(3+ ) α=1
(2.15)
(α) Ep+ [φ (α) (I)] = E f [φ (α) (I)] for α = 1, 2, . . . , K, β, and in general, λ(α) + 6= λ for α = 1, . . . , K. According to the above discussion, we choose feature φ (K+1) to maximize the entropy decrease over the remaining features; that is,
φ (K+1) = arg max d(φ (β) ), φ (β) ∈B/S
where d(φ (β) ) = KL( f, p) − KL( f, p+ ) = entropy(p) − entropy(p+ ) = KL(p+ , p) is the entropy decrease. Let 8(I) = (φ (1) (I), . . . , φ (K) (I)), since Ep+ [8(I)] = Ep [8(I)] = E f [8(I)], d(φ (β) ) is a function of the difference between p+ and p on feature φ (β) . By second-order Taylor expansion, d(φ (β) ) can be expressed in a quadratic form. Proposition 3. d(φ (β) ) =
In the above notation, 1 0 (Ep [φ (β) (I)] − E f [φ (β) (I)]) 2 (β) × Vp−1 (I)] − E f [φ (β) (I)]), 0 (Ep [φ
(2.16)
where p0 is a distribution such that Ep0 [8(I)] = E f [8(I)], and Ep0 [φ (β) (I)] lies −1 V12 , where V11 = between Ep [φ (β) (I)] and E f [φ (β) (I)]. Vp0 = V22 − V21 V11 0
Varp0 [8(I)], V22 = Varp0 [φ (β) (I)], V12 = Covp0 [8(I), φ (β) (I)], and V21 = V12 . See the appendix for proof.
(β) −1 , and let φ⊥ (I) = The Vp0 can be interpreted as follows. Let C = −V12 V11 (β) (β) φ (I)+C8(I) be the linear combination of φ (I) and 8(I); then under p0 , it (β)
(β)
can be shown that φ⊥ (I) is uncorrelated with 8(I), thus Vp0 = Varp0 [φ⊥ (I)] is the variance of φ (β) (I) with its dependence on 8(I) being eliminated. (β) In practice, E f [φ (β) (I)] is estimated by the observed statistic µobs , and (β)
Ep [φ (β) (I)] by µsyn —the sample mean computed from synthesized images
Minimax Entropy Principle
1639
sampled from the current model p. If the intermediate distribution p0 is approximated using the current distribution p, the distance d(φ (β) ) is approximated by d(φ (β) ) ≈
´0 ³ ´ 1 ³ (β) (β) −1 (β) µobs − µ(β) V − µ µ syn p syn , obs 2
(2.17)
(β)
where Vp = Varp (φ⊥ ) is the variance estimated from the synthesized images. We will further study the estimation of Vp in section 3.2. The feature pursuit procedure governed by equation 2.17 has the following intuitive interpretation. Under the current model p, for any new feature (β) φ (β) , µsyn is the statistic we observe from the image samples following p. If (β) (β) µsyn is close to µobs , then adding this new feature to p(I; 3) leads to little improvement in estimating f (I). So we should look for the most salient new (β) (β) feature φ (β) such that µsyn is very different from µobs . The saliency of the (β) new feature is measured by d(φ (β) ), which is the discrepancy between µsyn (β) and µobs scaled by Vp , where Vp is the variance of the new feature compensated for dependence of the new feature on the old ones under the current model. As a summary, Figure 2 illustrates the data flow for both the computation of the model and the pursuit of features. 3 More on Minimax Entropy 3.1 Correcting the Minimum Entropy Principle. In previous sections, for a set of features S = {φ (α) , α = 1, . . . , K}, we have studied two ME distriˆ S), which reproduces the observed feature statistics, butions. One is p(I; 3, that is, [φ (α) (I)] = µ(α) Ep(I;3,S) ˆ obs ,
for α = 1, . . . , K,
and the other is p(I; 3? , S), which reproduces the expected feature statistics, that is, Ep(I;3? ,S) [φ (α) (I)] = E f [φ (α) (I)],
for α = 1, . . . , K.
In the previous derivations, we assume that {E f [φ (α) (I)], α = 1, . . . , K} can be estimated exactly by the observed statistics {µ(α) obs , α = 1, . . . , K}, which is not true in practice since only a finite sample is observed. Taking the estimation errors into account, we need to correct the minimum entropy principle and the feature pursuit procedure. First, let us consider the minimum entropy principle, which relates the Kullback-Leibler divergence KL( f, p(I; 3, S)) to the entropy of the model ˆ the goodp(I; 3, S) for 3 = 3? . Since in practice 3 is estimated at 3, ˆ S)) instead ness of the actual model should be measured by KL( f, p(I; 3, of KL( f, p(I; 3? , S)), for which we have:
1640
Song Chun Zhu, Ying Nian Wu, and David Mumford
Proposition 4.
In the above notation,
ˆ S)) = KL( f, p(I; 3? , S)) + KL(p(I; 3? , S), p(I; 3, ˆ S)). (3.1) KL( f, p(I; 3, See the appendix for proof. ˆ S) does not come as close That is, because of the estimation error, p(I; 3, to f (I) as p(I; 3? , S) does, and the extra noise is measured by KL(p(I; 3? , S), ˆ S)). In fact, 3 ˆ in model p(I; 3, ˆ S) is a random variable depending p(I; 3, obs ˆ S)). Let Eobs on the random sample {Ii , i = 1, . . . , M}; so is KL( f, p(I; 3, stands for the expectation with respect to the training images. Applying Eobs to both sides of equation 3.1, we have, ˆ S))] Eobs [KL( f, p(I; 3, ˆ S))] = KL( f, p(I; 3? , S)) + Eobs [KL(p(I; 3? , S), p(I; 3, = entropy(p(I; 3? , S)) − entropy( f ) ˆ S))]. + Eobs [KL(p(I; 3? , S), p(I; 3,
(3.2)
ˆ S)). The following proposition relates entropy(p(I; 3? , S)) to entropy(p(I; 3, Proposition 5.
In the above notation,
ˆ S))] entropy(p(I; 3? , S)) = Eobs [entropy(p(I; 3, ˆ S), p(I; 3? , S))]. + Eobs [KL(p(I; 3,
(3.3)
See the appendix for proof. ˆ S) is on average smaller According to Proposition 5, the entropy of p(I; 3, ? ˆ than the entropy of p(I; 3 , S); this is because 3 is estimated from specific ˆ S) does a better job than p(I; 3? , S) in fitting training data, and hence p(I; 3, the training data. Combining equations 3.2 and 3.3, we have ˆ S))] = Eobs [entropy(p(I; 3, ˆ S))] Eobs [KL( f, p(I; 3, − entropy( f ) + C1 + C2 ,
(3.4)
where the two correction terms are ˆ S))], C1 = Eobs [KL(p(I; 3? , S), p(I; 3, ˆ S), p(I; 3? , S))]. C2 = Eobs [KL(p(I; 3, Following Ripley (1996, sec. 2.2), both C1 and C2 can be approximated by 1 −3/2 tr(Var f [8(I)]Varp−1 ), ∗ [8(I)]) + O(M 2M
Minimax Entropy Principle
1641
where tr( ) is the trace of matrix. Therefore, we arrive at the following form of the Akaike information criterion (Akaike, 1977): ˆ S))] ≈ Eobs [entropy(p(I; 3, ˆ S))] − entropy( f ) Eobs [KL( f, p(I; 3, 1 + tr(Var f [8(I)]Varp−1 ∗ [8(I)]), M where we drop the higher-order term O(M−3/2 ). The optimal set of features ˆ S))], which leads to the should be chosen to minimize Eobs [KL( f, p(I; 3, following correction of the minimum entropy principle: ˆ S)) + S∗ = arg min {entropy(p(I; 3, |S|=K
1 tr(Var f [8(I)]Varp−1 ∗ [8(I)])}. (3.5) M
In practice, Var f [8(I)] and Varp∗ [8(I)] can be estimated from the observed images and synthesized images, respectively. If Var f [8(I)] ≈ Varp∗ [8(I)], then tr(Var f [8(I)]Varp−1 ∗ [8(I)]) is approximately the number of free parameters in the model. This provides another reason for restricting the model complexity besides scientific parsimony and computational efficiency. Another perspective for this issue is the minimum description length (MDL) principle (Rissanen, 1989). Now let us consider correcting the feature pursuit procedure. Following the notation in section 2.4, at each step K + 1, suppose we choose a new feature φ (β) , and let 8+ (I) = (8(I), φ (β) (I)); the decrease of the expected Kullback-Leibler divergence is: Eobs [KL( f, p)] − Eobs [KL( f, p+ )] 1 = d(φ (β) ) − [tr(Var f [8+ (I)]Varp−1 ∗ [8+ (I)]) + M − tr(Var f [8(I)]Varp−1 ∗ [8(I)])]. By linear algebra, we can show that −1 tr(Var f [8+ (I)]Varp−1 ∗ [8+ (I)]) − tr(Var f [8(I)]Varp∗ [8(I)]) +
(β)
(β)
= tr(Var f [φ⊥ (I)]Varp∗+ [φ⊥ (I)]).
(3.6)
See the appendix for the proof of equation 3.6. Therefore, at every step of the (corrected) feature pursuit procedure, we should choose φ (β) to maximize 0
d (φ (β) ) = d(φ (β) ) −
´ 1 ³ (β) (β) tr Var f [φ⊥ ]Varp−1 ∗ [φ⊥ ] . + M (β)
(β)
In practice, we approximate Varp∗+ [φ⊥ ] by Varp [φ⊥ ], and estimate the (β) variances from the observed and synthesized images. Let µ and Vˆ obs be ⊥obs
1642
Song Chun Zhu, Ying Nian Wu, and David Mumford
(β) ˆ the sample mean and variance of {φ⊥ (Iobs i ), i = 1, 2, . . . , M} and let Vsyn be 0 (β) syn the sample variance of {φ⊥ (Ii ), i = 1, 2, . . . , M }. Thus we have 0
d (φ (β) ) ≈
1 (β) 1 (β) 0 −1 (β) −1 (µ − µobs ) Vˆ syn tr(Vˆ obs Vˆ syn (µ(β) ). syn − µobs ) − 2 syn M
(3.7)
We note that in equation 3.7, Ã ! M X 1 (β) obs (β) (β) obs (β) 0 ˆ −1 −1 ˆ ˆ tr(Vobs Vsyn ) = tr (φ⊥ (Ii ) − µ⊥obs )(φ⊥ (Ii ) − µ⊥obs ) Vsyn M i=1 =
M 1 X (β) (β) (β) (β) −1 (φ (Iobs ) − µ⊥obs )0 Vˆ syn (φ⊥ (Iobs i ) − µ⊥obs ) M i=1 ⊥ i
is a measure of fluctuation in the observed images. The intuitive meaning of equation 3.7 is the following. The first term is (β) (β) the distance between µsyn and µobs , and we call it the gain by introducing a (β) new feature φ . The second term measures the uncertainty in estimating E f [φ (β) (I)], and we call it the loss by adding φ (β) . If the loss term is large, 0 it means the feature is less common to the observed images; thus d (φ (β) ) 0 (β) (β) is small. When µsyn comes very close to µobs , d (φ (β) ) become negative, which provides a criterion for stopping the iteration in computing 3 in equation 2.10. 3.2 Variance Estimation in Homogeneous Random Field. In previous sections, we assume that we have M independent observations Iobs i ,i = is of the same size as the image domain D (N × N 1, 2, . . . , M, and each Iobs i pixels). A feature φ (β) (Iobs i ) is computed based on the intensities of an entire image, and the sample mean and variance are then computed from φ (β) (Iobs i ) i = 1, 2, . . . , M. The same is true for synthesized images. However, in many applications, such as texture modeling in the next section, it is assumed that the underlying distribution f (I) is ergodic and images I are homogeneous; thus, φ (β) (I) is often expressed as the average of local features ψ (β) ( ): φ (β) (I) =
1 X (β) ψ (I|W+Ev ), |D| v∈D
where ψ (β) is a function defined on locally supported (filter) windows W centered at vE ∈ D. Therefore by ergodicity we still estimate E f [φ (β) (I)] and Var f [φ (β) (I)] reasonably well even through only a single image is observed, provided that the observed image is large enough compared to the strength of autocorrelation. In particular, we adopt the method recently proposed by Sherman (1996) for the estimation of Var f [φ (β) (I)]. To fix notation, suppose we observe one
Minimax Entropy Principle
1643 j
image Iobs on an No ×No lattice Dobs , and we define subdomains Dm ⊂ Dobs , j = 1, 2, . . . , `. For simplicity, we assume all subdomains are square image patches of the same size of m × m pixels; m is usually chosen at the scale of √ No , and the subdomains may overlap each other. Then for each subdomain P j j Dm , we compute φ (β) (Dm ) = m12 vE∈D j ψ (β) (I|W+Ev ), and the sample mean m and variance are computed over the subdomains: (β)
µobs (Dm ) = (β)
Varobs (Dm ) =
` 1X j φ (β) (Dm ), ` j=1 ` 1X 0 j j (β) (β) (φ (β) (Dm ) − µobs (Dm ))(φ (β) (Dm ) − µobs (Dm )) . ` j=1
Then, according to Sherman (1996), Var f [φ (β) (I)] can be estimated by m2 (β) Varobs (Dm ). N2 Now let us consider the feature pursuit criterion in equation 3.7. For (β) feature φ⊥ we define variance Vˆ obs (D1 ) = m2 Vˆ obs (Dm ), where Vˆ obs (Dm ) is (β)
j
the sample variance of φ⊥ (Dm ), j = 1, 2, . . . , `. Then from the above result, we can approximate Vˆ obs in equation 3.7 by Vˆ obs (D1 )/N2 . Similarly Vˆ syn in equation 3.7 is replaced by Vˆ syn (D1 )/N2 . Thus we have 0
d (φ
(β)
· )≈N
2
1 (β) (β) 0 −1 (β) (µ − µobs ) Vˆ syn (D1 )(µ(β) syn − µobs ) 2 syn ¸ 1 −1 (D1 )) , − tr(Vˆ obs (D1 )Vˆ syn c
(3.8)
where c is |Dobs | minus the number of pixels around the boundary. From 0 equation 3.8, we notice that d (φ (β) ) is proportional to N2 —the size of domain D. A more rigorous study is often complicated by phase transition, and we shall not pursue it in this article. 4 Application to Texture Modeling This section applies the minimax entropy principle to texture modeling. 4.1 The General Problem. Texture is an important characteristic of surface property in visual scenes and a power cue in visual perception. A general model for textures has long been sought in both computational vision and psychology, but such a model is still far from being achieved because
1644
Song Chun Zhu, Ying Nian Wu, and David Mumford
of the vast diversity of the physical and chemical processes that generate textures and the large number of attributes that need to be considered. As an illustration of the diversity of textures, Figure 3 displays some typical texture images. Existing models for textures can be roughly classified into three categories: (1) dynamic equations or replacement rules, which simulate specific physical and chemical processes to generate textures (Witkin & Kass, 1991; Picard, 1996), (2) the kth-order statistics model for texture perception, that is, the famous Julesz’s conjecture (Julesz, 1962), and (3) MRF models. (For a discussion of previous models and methods, see Zhu et al., 1996.) In our method, a texture is considered an ensemble of images of similar texture appearances governed by a probability distribution f (I). As discussed in section 2, we seek a model p(I; 3, S) given a set of observed images. p(I; 3, S) should be consistent with human texture perception in the sense that if p(I; 3, S) estimates f (I) closely, the images sampled from p(I; 3, S) should be perceptually similar to the training images. 4.2 Choosing Features and Their Statistics. As the first step of applying the minimax entropy principle, we need to choose image features and their (α) statistics, that is, φ (α) (I) and µobs α = 1, 2, . . . , K. First, we limit our model to homogeneous textures; thus f (I) is stationary with respect to location vE. We assume that features of texture images can be extracted by “filters” F(α) , α = 1, 2, . . . , K, where F(α) can be a linear or v) denote the nonlinear function of the intensities of the image I. Let I(α) (E v) = F(α) (I|W+Ev ) is a function filter response at point vE ∈ D, that is, I(α) (E depending on the intensities inside window W centered at vE. Second, recent psychophysical research on human texture perception suggests that two homogeneous textures are often difficult to discriminate when they produce similar marginal distributions (histograms) of responses from a bank of filters (Bergen & Adelson, 1991; Chubb & Landy, 1991). Motivated by the psychophysical research, we make the following assumptions to limit the number of filters and the window size of each filter for computational reason, though these assumptions are not necessary conditions for our theory to hold true: 1. All features that concern texture perception can be captured by “locally” supported filters. By “locally” we mean that the sizes of filters should be much smaller than the size of the image. For example, the size of image is 256 × 256 pixels, and the window sizes of filters are limited to be less than 33 × 33 pixels. 2. Only a finite set of filters are used. Given a filter F(α) , we compute the histogram of the filtered image I(α) as the features of I. Therefore in texture modeling, the notation φ (α) (I) is
Minimax Entropy Principle
1645
(a)
(b)
(c)
(d)
(e)
(f)
Figure 3: Some typical texture images.
1646
Song Chun Zhu, Ying Nian Wu, and David Mumford
replaced by H(α) (I, z) =
1 X δ(z − I(α) (E v)), |D| vE∈D
α = 1, 2, . . . , K,
z∈R
where δ( ) is the Dirac point mass function concentrated at 0. Correspondingly the observed statistics µ(α) obs are defined as µ(α) obs (z) =
M 1 X H(α) (Iobs i , z), M i=1
α = 1, 2, . . . , K.
2 H(α) (I, z) and µ(α) obs (z) are, in theory, continuous functions of z. In practice, they are approximated by piecewise constant functions of a finite number L of bins and are denoted by H(α) (I) and µ(α) obs as L (e.g., L = 32) dimensional vectors in the rest of the article. are large so that the large As the sample size M is large or the images Iobs i (α) sample effect takes place by ergodicity, then µobs (z) will be a close estimate of the marginal distributions of f (I):
f (α) (z) = E f [H(α) (I, z)]. Another motivation for choosing µ(α) obs (z) as feature statistics comes from a mathematical theorem, which states that f (I) is determined by all its marginal distributions f (α) (z). Thus, if model p(I) reproduces f (α) (z) for all α, then p(I) = f (I) (Zhu et al., 1996). Substituting H(α) (I) for φ (α) (I) in equation 2.3, we obtain ( ) K X 1 (α) (α) exp − hλ , H (I)i , p(I; 3, S) = Z(3) α=1
(4.1)
which we call the FRAME model. Here the angle brackets indicate that we P (α) are taking a sum over bin z: that is, hλ(α) , H(α) (I)i = z λ(α) z H (I, z). The computation of the parameters 3 and the selection of filters F(α) proceed as described in the last section. For detailed analysis of the texture modeling algorithm, see Zhu et al. (1996). 4.3 FRAME: A New Class of MRF Models. In this section, we derive a continuous form for the FRAME model in equation 4.1, and compare it with existing MRF models. Compared with the definitions of φ (α) (I) and µ(α) , H(α) (I, z) and µ(α) (z) are considobs obs ered vectors of infinite dimensions. 2
Minimax Entropy Principle
1647
Since the histograms of an image are continuous functions, the constraint in the ME optimization problem is the following: " Ep
# 1 X (α) δ(z − I (E v)) = µ(α) obs (z), |D| vE∈D
∀z ∈ R, ∀E v ∈ D, ∀α.
(4.2)
By an application of Lagrange multipliers, maximizing the entropy of p(I) under the above constraints gives ( ) K XZ X 1 X 1 (α) (α) exp − λ (z) δ(z − I (E v))dz p(I; 3, S) = Z(3) |D| vE∈D α=1 vE∈D ( ) K X X 1 (α) (α) = exp − λ (I (E v)) . Z(3) α=1 vE∈D
(4.3)
Since z is a continuous variable, there is an infinite number of constraints. The Lagrange multipliers 3 = (λ(1) ( ), . . . , λ(K) ( )) take the form as onedimensional potential functions. More specifically when the filters are linear, v) = F(α) ∗ I(E v), and we can rewrite equation 4.3 as, I(α) (E ( ) K X X 1 (α) (α) exp − λ (F ∗ I(E v)) . p(I; 3, S) = Z(3) α=1 vE
(4.4)
Clearly, equations 4.3 and 4.4 are MRF models or, equivalently, Gibbs distributions. But unlike the previous MRF models, the potentials are built directly on the filter response instead of cliques, and the forms of the potential functions λ(α) ( ) are learned from the training images, so they can incorporate high-order statistics and thus model nongaussian properties of images. The FRAME model has much stronger expressive power than traditional clique-based MRF models. Every filter introduces the same number of L parameters regardless of its window size, which enables us to explore structures at large scales (e.g., the 33 × 33 pixel filters in modeling the fabric texture in section 4.5). It is easy to show that existing MRF models for texture are special cases of the FRAME model with the filters and their potential functions specified. Detailed comparison between the FRAME model and the MRF models is covered in Zhu et al. (1996). 4.4 Designing a Filter Bank. To describe a wide variety of textures, we need to specify a general filter bank, which serves as the “vocabulary” by analogy to language. We shall not discuss the rules for constructing an optimal filter bank; instead, we use the following five kinds of filters motivated
1648
Song Chun Zhu, Ying Nian Wu, and David Mumford
by the multichannel filtering mechanism discovered and generally accepted in neurophysiology (Silverman, Grosof, De Valois, & Elfar, 1989). 1. The intensity filter, δ( ), for capturing the DC component. 2. The Laplacian of gaussian filters, which are isotropic center surrounded and are often used to model retinal ganglion cells. The impulse response functions are of the following form: −
LG(x, y | T) = const ·(x2 + y2 − T2 )e We choose eight scales with T = with scale T is denoted by LG(T).
√
x2 +y2 T2
.
(4.5)
2/2, 1, 2, 3, 4, 5, and 6. The filter
3. The Gabor filters, which are models for the frequency and orientationsensitive simple cells. The impulse response functions are of the following form, 1
Gabor(x, y | T, θ ) = const ·e 2T2
(4(x cos θ +y sin θ )2 +(−x sin θ +y cos θ )2 )
× e−i T (x cos θ +y sin θ ) , 2π
(4.6)
where T controls the scales and θ controls the orientations. We choose six scales T = 2, 4, 6, 8, 10, and 12 and six orientations θ = 0◦ , 30◦ , 60◦ , 90◦ , 120◦ , and 150◦ . Notice that these filters are not nearly orthogonal to each other, so there is overlap among the information captured by them. The sine and cosine components are denoted by G sin(T, θ ) and G cos(T, θ ), respectively. 4. The nonlinear Gabor filters, which are models for the complex cells, and responses from which are the powers of the responses from a pair of Gabor filters, | Gabor(x, y | T, θ) ∗ I|2 . This filter denoted by SP(T, θ) is, in fact, the local spectrum of I at (x, y) smoothed by a gaussian function. 5. Some specially designed filters for texton primitives. (See section 4.5.) 4.5 Experiments of Texture Modeling. This section describes the modeling of natural textures using the algorithm studied in sections 2 and 3. The first texture image is described in detail to illustrate the filter pursuit procedure. Suppose we are modeling f (I) where I is of 64 × 64 pixels. Figure 4a is an observed image of animal fur (128 × 128 pixels). We start from the filter set S = ∅ and p(I; 3, S) a uniform distribution from which a uniform white noise image is sampled and is displayed in Figure 4b (128 × 128 pixels). The 0 algorithm first computes d (φ (1) ) according to equations 3.7 and 3.8 for each
Minimax Entropy Principle
1649 0
Table 1: The Entropy Decrease d (φ (β) ) for Filter Pursuit. Filter δ √ LG( 22 ) LG(1) LG(2) G cos(2, 0◦ ) G cos(2, 30◦ ) G cos(2, 60◦ ) G cos(2, 90◦ ) G cos(2, 120◦ ) G cos(2, 150◦ ) G cos(4, 0◦ ) G cos(4, 30◦ ) G cos(4, 60◦ ) G cos(4, 90◦ ) G cos(4, 120◦ ) G cos(4, 150◦ ) G cos(6, 0◦ ) G cos(6, 30◦ ) G cos(6, 60◦ ) G cos(6, 90◦ ) G cos(6, 120◦ ) G cos(6, 150◦ ) G cos(8, 0◦ ) G cos(8, 30◦ ) G cos(8, 60◦ ) G cos(8, 90◦ ) G cos(8, 120◦ ) G cos(8, 150◦ )
Size
d0 (φ (1) )
d0 (φ (2) )
d0 (φ (3) )
d0 (φ (4) )
d0 (φ (5) )
d0 (φ (8) )
1×1 3×3 5×5 9×9 5×5 5×5 5×5 5×5 5×5 5×5 7×7 7×7 7×7 7×7 7×7 7×7 11 × 11 11 × 11 11 × 11 11 × 11 11 × 11 11 × 11 15 × 15 15 × 15 15 × 15 15 × 15 15 × 15 15 × 15
1018.2 4205.9 4492.3 20.2 3140.8 4240.3 3548.8 1063.3 1910.7 3717.2 958.2 2205.8 1199.5 108.8 19.2 157.5 102.1 217.3 85.6 13.6 321.7 3.8 −1.6 2.4 10.7 203.0 586.8 140.1
42.2 466.0 — 465.7 188.3 668.0 124.6 62.1 26.2 220.7 25.7 125.5 32.7 229.6 1146.4 247.1 12.8 54.8 4.7 134.8 706.8 100.1 11.0 33.0 5.5 51.9 276.6 44.6
50.8 107.4 — 159.3 140.4 307.6 25.1 38.1 2.0 189.2 17.9 61.0 35.4 130.6 — 10.4 4.3 8.4 0.1 192.4 640.3 12.7 −0.2 2.1 −1.2 71.7 361.8 1.3
20.0 172.9 — 24.5 137.0 317.8 21.9 90.3 2.5 161.7 5.3 75.2 12.2 20.2 — 101.9 −1.2 32.9 4.5 −0.4 — 98.6 4.6 13.8 4.1 3.9 58.2 45.5
26.4 41.6 — 6.3 135.4 — 14.2 40.7 47.6 9.3 8.2 35.0 10.9 31.9 — 56.0 19.0 11.5 3.8 7.9 — 75.1 9.7 12.7 6.8 12.3 58.2 42.5
*−1.8 22.6 *−1.4 18.5 *−3.2 *−1.9 7.5 1.1 16.4 −0.8 6.4 0.9 6.9 30.2 *−2.7 3.9 1.8 −1.7 6.0 1.6 *−2.8 *−1.4 14.3 −0.1 1.3 6.8 3.7 38.0 (β)
Notes: ∗ This filter has been chosen. Value computed using feature φ (β) , not φ⊥ . The boldface numbers are the largest in each column and are thus chosen in the algorithm. 0
filter, and d (φ (1) ) for some filters are listed in Table 1. Filter LG(1) has the largest entropy decrease and thus is chosen as the first filter, S = {LG(1)}. Then a model p(I; 3, S) is computed, and a synthesized image is shown in Figure 4c. Comparing Figure 4c with Figure 4b, it is evident that this filter captures local smoothness features of the observed texture image. Continuing the algorithm, six more filters are sequentially added: (2) G cos(4, 120◦ ); (3) G cos(6, 120◦ ); (4) G cos(2, 30◦ ); (5) G cos(2, 0◦ ); (6) G cos(6, 150◦ ); and (7) intensity δ( ). The texture images synthesized using 3, 4, and 7 filters are displayed in Figures 4d–f. Obviously, with more filters added, the synthesized texture image gets closer to the observed one. After choosing seven filters, the entropy decrease for all filters becomes very small; some are negative. Similar results are observed for those filters not listed in Table 1. This confirms our early assumption that the marginal distributions of a small
1650
Song Chun Zhu, Ying Nian Wu, and David Mumford
(a)
(b)
(c)
(d)
(e)
(f)
Figure 4: Synthesis of the fur texture. (a) The observed image. (b–f) The synthesized images using 0, 1, 3, 4, 7 filters, respectively.
Minimax Entropy Principle (a)
1651 (b)
Figure 5: (a) The observed texture: mud. (b) The synthesized one using five filters. (a)
(b)
Figure 6: (a) The observed texture image: cheetah blob. (b) The synthesized one using six filters.
number of filtered images should be adequate for capturing the essential features of the underlying probability distribution f (I).3 Figure 5a is the scene of mud ground with scattered animal footprints, which are filled with water and thus get brighter. This texture image shows sparse features. Figure 5b is the synthesized texture image using five filters. Figure 6a is an image taken from the skin of a cheetah, and Figure 6b displays the synthesized texture using six filters. Notice that the original 3 The synthetic fur texture in these figures is better than that in Zhu et al. (1996) since the L1 criterion used here for filter pursuit has been replaced by the criterion of equation 3.7.
1652
Song Chun Zhu, Ying Nian Wu, and David Mumford
(a)
(b)
(c)
(d)
Figure 7: (a) The input image of fabric. (b) The synthesized image with two pairs of Gabor filters plus the Laplacian of gaussian filter. (c, d) Two more images sampled at different steps of the Gibbs sampler.
observed texture image is not homogeneous, since the shapes of the blobs vary systematically with spatial locations, and the left upper corner is darker than the right lower one. The synthesized texture, shown in Figure 6b, also has elongated blobs introduced by different filters, but the bright pixels seem to spread uniformly across the image due to the effect of entropy maximization. Figure 7a shows a texture of fabric that has clear periods along both horizontal and vertical directions. We choose two nonlinear filters: spectrum analyzers SP(17, 0◦ ) and SP(17, 90◦ ) , with their periods T tuned to the periods of the texture, and the window sizes of the √ filters are 33 × 33 pixels. We also use the intensity filter δ( ) and filter LG( 2/2) to take care of the
Minimax Entropy Principle (a)
1653 (b)
(c)
(d)
Figure 8: Two typical texton images of 256 × 256 pixels: (a) circle and (b) cross. (c, d) The two synthesized images of 128 × 128 pixels.
intensity histogram and the smoothness features. Three synthesized texture images are displayed in Figures 7b–d at different sampling steps. This experiment shows that once the Markov chain becomes stationary or gets close to stationary, the sampled images from p(I) will always have perceptually similar appearances but with different details. Figures 8a and 8b show two special binary texture images formed from identical textons (circles and crosses), which are studied extensively by psychologists for the purpose of understanding human texture perception. Our interest here is to see whether this class of textures can still be modeled by FRAME. We use the linear filter whose impulse response function is a mask with the corresponding texton at the center. With this filter selected, Figure 1b plots the histograms of the filtered image F ∗ I, with I being the texton image observed in Figure 8a (solid curve) and a uniform noise image (dotted curve). Observe that there are many isolated peaks in the observed histogram, which stand for important image features. The computation of the model is complicated by the nature of such isolated peaks, and we proposed an annealing approach for computing 3 (for details see Zhu et al.,
1654
Song Chun Zhu, Ying Nian Wu, and David Mumford
1996). Figures 8c and 8d show two synthesized images. 5 Discussion This article proposes a minimax entropy principle for building probability models in a variety of applications. Our theory answers two major questions. The first is feature binding or feature fusion: how to integrate image features and their statistics into a single joint probability distribution without limiting the forms of the features. The second is feature selection: how to choose a set of features to characterize best the observed images. Algorithms are proposed for parameter estimation and stochastic simulation. A greedy algorithm is developed for feature pursuit, and the minimax entropy principle is corrected for the presence of sample variations. As an example of applications, we apply the minimax entropy principle to modeling textures. There are various artificial categories for textures with respect to various attributes, such as Fourier and non-Fourier, deterministic and stochastic, and macro- and microtextures. FRAME erases these artificial boundaries and characterizes them in a unified model with different filters and parameter values. It has been well recognized that the traditional MRF models, as special cases of FRAME, can be used to model stochastic, non-Fourier microtextures. From the textures we synthesized, it is evident that FRAME is also capable of modeling periodic and deterministic textures (fabric), textures with large-scale elements (fur and cheetah blob), and textures with distinguishable textons (circles and cross bars). Our method for texture modeling was inspired by and bears some similarities to the recent work by Heeger and Bergen (1995) on texture synthesis, where many natural-looking texture images are successfully synthesized by matching the histograms of filter responses organized in the form of a pyramid. Compared with Heeger and Bergen’s algorithm, the FRAME model is distinctive in the following aspects. First, we obtain a probability model p(I; 3, S) instead of merely synthesizing texture images. Second, the Monte Carlo Markov chain for model estimation and texture sampling is guaranteed to converge to a stationary process that follows the estimated distribution p(I; 3, S) (Geman & Geman, 1984), and the observed histograms can be matched closely. However, the FRAME model is computationally expensive, and approaches for further facilitating the computation are yet to be developed. For more discussion on this aspect, see Zhu et al. (1996). Many textures seem still difficult to model, such as the two human synthesized cloth textures shown in Figure 9. It appears that synthesizing such textures requires far more sophisticated features than those we have used in the texture modeling experiments, and these features may correspond to a high-level visual process, such as the geometrical properties of object shape. In this article, we choose filters from a fixed set of filters, but in general it is not understood how to design such set of features or structures for an arbitrary applications.
Minimax Entropy Principle (a)
1655 (b)
Figure 9: Two challenging texture images.
An important issue is whether the minimax entropy principle for model inference is “biologically plausible” and might be considered a model for the method used by natural intelligences in constructing models of classes of images. From a computational standpoint, the maximum entropy phase of the algorithm consists mainly of approximating the values of the Lagrange multipliers, which we have done by hill climbing with respect to log likelihood. Specifically, we have used Monte Carlo methods to sample our distributions and plugged the sampled statistics into the gradient of log likelihood. One of the authors has conjectured that feedback pathways in the cortex may serve the function of forming mental images on the basis of learned models of the distribution on images (Mumford, 1992). Such a mechanism might well sample by Monte Carlo as in the algorithm in this article. That theory further postulated that the cortex seeks out the “residuals,” the features of the observed image different from those of the mental image. The algorithm shows how such residuals can be used to drive a learning process in which the Lagrange multipliers are gradually improved to increase the log likelihood. We would conjecture that these Lagrange multipliers are stored as suitable synaptic weights in the higher visual area or in the top-down pathway. Given the massively parallel architecture, the apparent stochastic component in neural firing, and the huge amount of observed images processed every day, the computational load of our algorithm may not be excessive for cortical implementation. The minimum entropy phase of our algorithm has some direct experimental evidence in its favor. There has been extensive psychophysical experimentation on the phenomenon of preattentive texture discrimination.
1656
Song Chun Zhu, Ying Nian Wu, and David Mumford
We propose that textures that can be preattentively discriminated are exactly those for which suitable filters have been incorporated into a minimum entropy cortical model and that the process by which subjects can train themselves to discriminate new sets of textures preattentively is exactly that of incorporating a new filter feature into the model. Evidence that texture pairs that are not preattentively segmentable by naive subjects become segmentable after practice has been reported by many groups, most notably by Karni and Sagi (1991). The remarkable specificity of the reported texture discrimination learning suggests that very specific new filters are incorporated into the cortical texture model, as in our theory. Appendix: Mathematical Details Proof of Theorem 1. Let 3? = (λ?(1) , λ?(2) , . . . , λ?(K) ) be the parameter. By definition we have Ep(I;3? ,S) [φ (α) (I)] = E f [φ (α) (I)], α = 1, . . . , K. E f [log p(I; 3? , S)] = −E f [log Z(3? )] −
K X
E f [hλ?(α) , φ (α) (I)i],
α=1 K X hλ?(α) , E f [φ (α) (I)]i, = − log Z(3? ) − α=1 K X hλ?(α) , Ep(I;3? ,S) [φ (α) (I)]i, = − log Z(3? ) − α=1
= Ep(I;3? ,S) [log p(I; 3? , S)] = −entropy(p(I; 3? , S)), and the result follows. Proof of Proposition 3. Let 8 = (φ (1) (I), . . . , φ (K) (I)), 8+ = (8(I), φ (β) (I)). We have the entropy decrease d(φ (β) ) = KL(p+ ; p) 1 = (Ep [8+ (I)] − Ep+ [8+ (I)])0 Varp0 [8+ (I)]−1 2 × (Ep [8+ (I)] − Ep+ [8+ (I)]) 1 = (Ep [φ (β) (I)] − E f [φ (β) (I)])Vp−1 0 2 × (Ep [φ (β) (I)] − E f [φ (β) (I)]).
(A.1)
(A.2)
Equation A.1 follows a second-order Taylor expansion argument (corollary 4.4 of Kullback, 1959, p. 48), where p0 is a distribution whose expected feature statistics are between those of p and p+ , and à ! Covp0 [8(I), φ (β) (I)] Varp0 [8(I)] Varp0 [8+ (I)] = Varp0 [φ (β) (I)] Covp0 [φ (β) (I), 8(I)]
Minimax Entropy Principle
µ =
1657
V12 V22
V11 V21
¶ .
Equation A.2 results from the fact that Ep+ [8(I)] = Ep [8(I)], and by the −1 V12 . Schur formula, it is well known that Vp0 = V22 − V21 V11 Proof of Proposition 4. From the proof of Theorem 1, we know E f [log p (I; 3? , S)] = Ep(I;3? ,S) [log p(I; 3? , S)], and by similar derivation we have E f [log p(I; 3, S)] = Ep(I;3? ,S) [log p(I; 3, S)] for any 3. KL( f, p(I; 3, S)) = E f [log f (I)] − E f [log p(I; 3, S)] = E f [log f (I)] − Ep(I;3? ,S) [log p(I; 3, S)]
= E f [log f (I)] − E f [log p(I; 3? , S)] + Ep(I;3? ,S) [log p(I; 3? , S)] − Ep(I;3? ,S) [log p(I; 3, S)]
= KL( f, p(I; 3? , S)) + KL(p(I; 3? , S), p(I; 3, S)). ˆ The result follows by setting 3 = 3. ˆ S) = −entropy Proof of Proposition 5. By Proposition 2 we have L(3, ˆ (p(I; 3, S)). By similar derivation, we can prove that Eobs [L(3? , S)] = −entropy(p(I; 3? , S)) and ˆ S), p(I; 3? , S)). ˆ S) − L(3? , S) = KL(p(I; 3, L(3,
(A.3)
Applying Eobs to both sides of equation A.3, we have ˆ S))] + entropy(p(I; 3? , S)) −Eobs [entropy(p(I; 3, ˆ S), p(I; 3? , S))], = Eobs [KL(p(I; 3, and the result follows. Proof of Equation 3.6.
To simplify the notation, we denote
à Var [8+ (I)] = p∗+
à = à A=
Varp∗+ [8(I)]
Covp∗+ [8(I), φ (β) (I)]
Covp∗+ [φ (β) (I), 8(I)]
Varp∗+ [φ (β) (I)]
X11
X12
X21
X22
I1
0
−1 −X21 X11
I2
! ,
! ,
!
1658
Song Chun Zhu, Ying Nian Wu, and David Mumford
and à B=
!
Varp∗+ [8(I)]
0
0
Varp∗+ [φ⊥ (I)]
(β)
,
(β)
where I1 , I2 are identity matrices, and φ⊥ (I) is uncorrelated with 8(I) under 0 (β) p∗+ . So we have A8+ (I) = (8(I), φ⊥ (I))0 and AVarp∗+ [8+ (I)]A = B. 0 −1 A, and since Var ∗ [8(I)] = Var ∗ [8(I)], thereThus Varp−1 ∗ [8+ (I)] = A B p+ p + fore 0
−1 tr(Var f [8+ (I)]Varp−1 A) ∗ [8+ (I)]) = tr(Var f [8+ (I)](A )B +
0
= tr((AVar f [8+ (I)]A )B−1 ) = tr(Var f [8(I)]Varp−1 ∗ [8(I)]) (β)
(β)
+ tr(Var f [φ⊥ (I)]Varp−1 ∗ [φ⊥ (I)]) +
and equation 3.6 follows. Acknowledgments We are very grateful to two anonymous referees, whose insightful comments greatly improved the presentation of this article. This work is supported by the NSF grant DMS-91-21266 to David Mumford. The second author is supported by a grant to Donald B. Rubin. References Akaike, H. (1977). On entropy maximization principle. In P. R. Krishnaiah (Ed.), Applications of Statistics (pp. 27–42). Amsterdam: North-Holland. Barlow, H. B., Kaushal, T. P., & Mitchison, G. J. (1989). Finding minimum entropy codes. Neural Computation, 1, 412–423. Bergen, J. R., & Adelson, E. H. (1991). Theories of visual texture perception. In D. Regan (Ed.), Spatial Vision, Boca Raton, FL: CRC Press, 1991. Besag, J. (1973). Spatial interaction and the statistical analysis of lattice systems (with discussion). J. Royal Stat. Soc., B, 36, 192–236. Brown, L. D. (1986). Fundamentals of statistical exponential families: With applications in statistical decision theory. Hayward, CA: Institute of Mathematical Statistics. Chubb, C., & Landy, M. S. (1991). Orthogonal distribution analysis: A new approach to the study of texture perception. In M. S. Landy & J. A. Movshon (Eds.), Computational models of visual processing. Cambridge, MA: MIT Press. Coifman, R. R., & Wickerhauser, M. V. (1992). Entropy based algorithms for best basis selection. IEEE Trans. on Information Theory, 38, 713–718.
Minimax Entropy Principle
1659
Cross, G. R., & Jain, A. K. (1983). Markov random field texture models. IEEE, PAMI, 5, 25–39. Daugman, J. (1985). Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. Journal of Optical Soc. Amer., 2(7). Dayan, P., Hinton, G. E., Neal, R. N., & Zemel, R. S. (1995). The Helmholtz machine. Neural Computation, 7, 889–905. Donoho, D. L., & Johnstone, I. M. (1994). Ideal de-noising in an orthonormal basis chosen from a library of bases. Acad. Sci. Paris, Ser. I, 319, 1317–1322. Field, D. (1994). What is the goal of sensory coding? Neural Computation, 6, 559– 601. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Trans. PAMI, 6, 721–741. Heeger, D. J., & Bergen, J. R. (1995). Pyramid-based texture analysis/synthesis. In Computer Graphics Proceedings (pp. 229–238). Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106, 620–630. Jolliffe, I. T. (1986). Principle components analysis. New York: Springer-Verlag. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Julesz, B. (1962). Visual pattern discrimination. IRE Transactions of Information Theory, IT-8, 84–92. Julesz, B. (1995). Dialogues on perception. Cambridge, MA: MIT press. Karni, A., & Sagi, D. (1991). Where practice makes perfect in texture discrimination—evidence for primary visual cortex plasticity. Proc. Nat. Acad. Sci. U.S.A., 88, 4966–4970. Kullback, S. (1959). Information theory and statistics. New York: Wiley. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annual Math. Stat., 22, 79–86. Mallat, S. (1989). Multi-resolution approximations and wavelet orthonormal bases of L2 (R). Trans. Amer. Math. Soc., 315, 69–87. Mumford, D. B. (1992). On the computational architecture of the neocortex II: The role of cortico-cortical loops. Biological Cybernetics, 66, 241–251. Mumford, D. B., & Shah, J. (1989). Optimal approximations by piecewise smooth functions and associated variational problems. Comm. Pure Appl. Math., 42, 577–684. Picard, R. W. (1996). A society of models for video and image libraries (Technical Rep. No. 360). Cambridge, MA: MIT Media Lab. Priestley, M. B. (1981). Spectral analysis and time series. San Diego: Academic Press. Ripley, B. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Rissanen, J. (1989). Stochastic complexity in statistical inquiry. Singapore: World Scientific. Sherman, M. (1996). Variance estimation for statistics computed from spatial lattice data. J. R. Statistics Soc., B, 58, 509–523. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27.
1660
Song Chun Zhu, Ying Nian Wu, and David Mumford
Silverman, M. S., Grosof, D. H., De Valois, R. L., & Elfar, S. D. (1989). Spatialfrequency organization in primate striate cortex. Proc. Natl. Acad. Sci. U.S.A., 86. Simoncelli, E. P., Freeman, W. T., Adelson, E. H., & Heeger, D. J. (1992). Shiftable multi-scale transforms. IEEE Trans. on Information Theory, 38, 587–607. Watson, A. (1987). Efficiency of a model human image code. J. Opt. Soc. Am. A, 4(12), 2401–2417. Winkler, G. (1995). Image analysis, random fields and dynamic Monte Carlo methods. Berlin: Springer-Verlag. Witkin, A., & Kass, M. (1991). Reaction-diffusion textures. Computer Graphics, 25, 299–308. Xu, L. (1995). Ying-Yang machine: A Bayesian-Kullback scheme for unified learnings and new results on vector quantization. Proc. Int’l Conf. on Neural Info. Proc. Hong Kong. Younes, L. (1988). Estimation and annealing for Gibbsian fields (STMA V30 1845). Annales de l’Institut Henri Poincar´e, Section B, Calcul des Probabilities et Statistique, 24, 269–294. Zhu, S. C. (1996). Statistical and computatinal theories for image segmentation, texture modeling and object recognition. Unpublished Ph.D. dissertation, Harvard University. Zhu, S. C., & Mumford, D. B. (1997). Learning generic prior models for visual computation. Proc. of Int’l Conf. on Computer Vision and Pattern Recognition. Puerto Rico. Zhu, S. C., Wu, Y. N., & Mumford, D. B. (1996). FRAME: Filters, random fields and maximum entropy—to a unified theory for texture modeling. In Proc. of Int’l Conf. on Computer Vision and Pattern Recognition. San Francisco. Received April 10, 1996; accepted March 27, 1997. Address correspondence to [email protected].
NOTES
Communicated by Shun-ichi Amari
A Local Learning Rule That Enables Information Maximization for Arbitrary Input Distributions Ralph Linsker IBM Research Division, T. J. Watson Research Center, Yorktown Heights, NY 10598, U.S.A.
This note presents a local learning rule that enables a network to maximize the mutual information between input and output vectors. The network’s output units may be nonlinear, and the distribution of input vectors is arbitrary. The local algorithm also serves to compute the inverse C−1 of an arbitrary square connection weight matrix. 1 Introduction Unsupervised learning algorithms that maximize the information transmitted through a network—the infomax principle (Linsker, 1988)—typically require the computation of a function that is, to first appearances, highly nonlocal. This occurs because the units perform gradient ascent in a quantity that is essentially the joint entropy of the output values. If the input distribution is multivariate gaussian and the network mapping is linear, then (Linsker, 1992) the output entropy is proportional to ln det Q where Q is the covariance matrix of the output values. If the input distribution is arbitrary and the network mapping is nonlinear but constrained to be invertible (Bell & Sejnowski, 1995), then the output entropy contains the term ln | det C| where C is the (assumed square) matrix of feedforward connection strengths. In the latter case, the inverse of the connection strength matrix, (CT )−1 , must be computed. Unless a local infomax learning rule exists—one that makes use only of quantities available at the connection being updated—it is implausible that a biological system could implement this optimization principle. Linsker (1992) showed how to perform gradient ascent in ln det Q using a local algorithm. That result was embedded within the context of a linear network operating on gaussian-distributed input with additive noise. However, the essential features of that local algorithm can be applied to a broader class of information-maximizing networks, as shown here. This note uses a familiar network architecture—comprising feedforward and recurrent lateral connections—in which the lateral connections change according to an anti-Hebbian rule. This network computes the gradient of ln det Q and, in so doing, also computes the inverse (CT )−1 of an arbitrary square matrix of feedforward weights. The algorithm is entirely local; the Neural Computation 9, 1661–1665 (1997)
c 1997 Massachusetts Institute of Technology °
1662
Ralph Linsker
computation of each required quantity for each connection uses only information available at that connection. By using this local algorithm in conjunction with the method of Bell and Sejnowski (1995), one obtains a local learning rule for information maximization that applies to arbitrary input distributions provided the network mapping is constrained to be invertible. In an alternative method for entropy-gradient learning (Amari, Cichocki, and Yang, 1996), the need to compute the inverse of the connection weight matrix is avoided by defining the weight change to be proportional to the entropy gradient multiplied by CT C. In that case, lateral connections are absent, but both a feedforward and a feedback pass through the connections C are required. 2 The Network Consider a network with input vector x and feedforward connection matrix C that computes u ≡ Cx. Each component of u may be passed through a nonlinear function such as a sigmoid, but the nature of this “squashing” function is irrelevant here. See Bell and Sejnowski (1995) for a discussion of the role of this nonlinearity in information maximization. Lateral connections, between each pair of output units, are described by a matrix F. They are used recursively to compute the auxiliary vector v(t) = Cx + Fv(t − 1) at time steps t = 1, 2, . . . , where v(0) = Cx. For fixed x, v(t) converges to v = Cx + Fv, that is, v = (I − F)−1 Cx,
(2.1)
provided all eigenvalues of F have absolute value < 1. 3 Local Learning Rule We use this network to compute (CT )−1 for a square matrix C as follows. Define q = hxxT i and Q = huuT i = CqCT , where h.i denotes an ensemble average and superscript T denotes transpose. Assume that the ensemble of input vectors x spans the input space; then q is a positive definite matrix. (If x is defined so that hxi = 0, then q and Q are the covariance matrices of the input and output vectors, respectively.) Suppose for the moment that the lateral weights F can be made to satisfy F = I−αQ (for given C).1 Then Q−1 = α(I−F)−1 , whence I = α(I−F)−1 CqCT , 1 α is a positive number chosen to ensure that the eigenvalue condition on F (above) is satisfied. In order to ensure convergence of the recursively computed vector v, it is necessary and sufficient to choose 0 < α < 2/Q+ where Q+ is the largest eigenvalue of Q. See Linsker (1992) for more detail, including how to choose α to optimize the convergence rate of v.
Local Learning Rule
1663
yielding (CT )−1 = α(I − F)−1 Cq = αhvxT i,
(3.1)
where equation 2.1 has been used at the last step. This is a local network computation, since the component of the matrix inverse at connection i → n, namely, (C−1 )in = hvn xi i, is computed using only quantities available at the two ends of that connection. The lateral weights F achieve the required values by means of either of the following local algorithms: (1) For given C, one presents a set of input vectors x, computes the corresponding u vectors, and sets F = I − αhuuT i (= I − αQ). However, since the C values will themselves be changing during a training process, it is computationally inefficient to hold C fixed while presenting an ensemble of input vectors at each C. (2) It is more practical to let C evolve slowly and incrementally update F according to 1F = β(−αuuT + I − F), where β is a learning rate parameter for lateral connections. This is an antiHebbian learning rule. It is a local computation since the change at each lateral connection m → n, namely, 1Fnm = β(−αun um +δnm −Fnm ), depends only on information available at that connection. The network operation is summarized as follows: Vector x is presented and held at the inputs. The linear output un = 6Cni xi is stored at each output unit n. For an incremental learning rule, each pair of outputs un and um is used to compute the change 1Fnm in the weight of the lateral connection m → n. The outputs vn (t) are iteratively computed until asymptotic values vn are obtained. For each feedforward connection i → n, the input xi and output vn are used to compute the component [(CT )−1 ]ni for that connection. One combines this calculation with the method of Bell and Sejnowski (1995) to yield a completely local learning rule for arbitrary input distributions (assuming invertible C) as follows: The already-stored value of un and the input xi are used to compute the remaining term h(un )xi in Bell and Sejnowski’s feedforward weight update expression, which has the form: 1Cni ∝ [(CT )−1 ]ni + h(un )xi , where h is a nonlinear function of a unit’s output value. Following the application of the learning rule for C, whether that of Bell and Sejnowski (1995) or some other rule requiring the computation of the feedforward matrix inverse, a new input vector x is then presented, and the previous steps are repeated until C converges. Further comments relating the present method to that of Linsker (1992): 1. Relation between (CT )−1 and the gradient ∂(ln det Q)/∂C: We have (1/2)∂(ln det Q)/∂C = Q−1 Cq = (CqCT )−1 Cq = (CT )−1 .
(3.2)
A derivation of the first equality is given in Linsker (1992); it requires that Q be positive definite, which is true for nonsingular C.
1664
Ralph Linsker
2. Relation between gradients of ln det Q and of ln | det C|: We have ln det Q = ln det(CqCT ) = 2 ln | det C|+ln det q. The last term is fixed by the input distribution. Therefore, ∂(ln det Q)/∂C = 2∂(ln | det C|)/∂C. 3. For the noiseless case treated here, the value of ∂(ln det Q)/∂C is independent of the input statistics (i.e., the matrix q). In an informationmaximizing process in which the function being maximized is ln det(CqCT ) + S, where S may incorporate other terms (e.g., related to a nonlinear “squashing” function or a linear gain-control term, discussed later), the input statistics enter solely through their effect on S. This statement is not true when noise is present, as in Linsker (1992); then Q = CqCT + (noise variance term), and q does not vanish from the gradient term ∂(ln det Q)/∂C. 4 Comparison of Two Instantiations of the Infomax Principle The principle of maximum information preservation (infomax) states that an input-output mapping should be chosen from a set of admissible mappings so as to maximize the amount of Shannon information that the outputs jointly convey about the inputs (Linsker, 1988). The principle has been applied to linear maps with a gaussian input distribution (Linsker, 1989a, 1992; Atick, 1992), to nonlinear maps with arbitrary input distributions in which a single output node “fires” (Linsker, 1989b) and to nonlinear invertible maps with arbitrary input distributions and an equal number of inputs and outputs (Bell and Sejnowski, 1995). We have shown that the essential features of the local learning rule used in Linsker (1992) also provide a local rule for the Bell and Sejnowski (1995) model. What are the essential similarities and differences between the operation of the infomax principle in these two models? In Linsker (1992), the presence of input and output noise leads to a twophase procedure in which the network is trained by input signals (plus noise) and by noise alone. In each phase, the joint entropy of the outputs is computed. The variance of each output unit is held fixed by a gain control, in order to impose a bound on the unit’s dynamic range. In the zero-noise limit, the algorithm maximizes the joint entropy of the output units. The solution is well defined (weights do not diverge) because of the gain control. The maximized quantity has the form (1/2)(ln det Q)+ GC, where (1/2) ln det Q is the output entropy because the inputs (hence the outputs) have a gaussian distribution, and GC is the term arising from the gain control. In Bell and Sejnowski (1995), there is no noise, and the output entropy is maximized by maximizing ln | det C| + 6hln |g0n |i where g0n is the slope of the nonlinear squashing function at the nth output unit. As shown above, the first term is equal (apart from a constant term) to (1/2) ln det Q for zero noise. The second term is the nonlinear analog of the GC gain control term.
Local Learning Rule
1665
The nonlinear function provides the important benefit of being responsive to higher-order (nongaussian) statistics. Two points emerge from this comparison. First, the quantities being optimized in the two models are quite similar, although noise and nonlinearity each contributes distinctive and important terms to the learning rules. Second, it is the limitation of the output units’ dynamic range, whether by linear gain control or a nonlinear function, that prevents the weights and the information capacity from increasing without bound. Neither the presence of noise in the Linsker (1992) model nor the fact that the squashing function is nonlinear in Bell and Sejnowski (1995) is needed for the output entropy maximization process to be well defined. Understanding how different model elements affect infomax learning will be of value as information-maximizing principles are applied to more complex and biologically realistic networks. References Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network, 3, 213–251. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximisation approach to blind separation and blind deconvolution. Neural Comp., 7, 1129–1159. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21 March, 105–117. Linsker, R. (1989a). An application of the principle of maximum information preservation to linear systems. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 186–194). San Mateo, CA: Morgan Kaufmann. Linsker, R. (1989b). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Comp., 1, 402–411. Linsker, R. (1992). Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Comp., 4, 691–702. Received October 14, 1996; accepted May 12, 1997.
Communicated by Helge Ritter
Convergence and Ordering of Kohonen’s Batch Map Yizong Cheng Department of Electrical and Computer Engineering and Computer Science, University of Cincinnati, Cincinnati, OH 45221, U.S.A.
The convergence and ordering of Kohonen’s batch-mode self-organizing map with Heskes and Kappen’s (1993) winner selection are proved. Selim and Ismail’s (1984) objective function for k-means clustering is generalized in the convergence proof of the self-organizing map. It is shown that when the neighborhood relation is doubly decreasing, order in the map is preserved. An unordered map becomes ordered when a degenerate state of ordering is entered, where the number of distinct winners is one or two. One strategy to enter this state is to run the algorithm with a broad neighborhood relation. 1 Introduction Kohonen’s self-organizing map adds a neighborhood relation to the cluster centers in competitive learning, so the data space is mapped onto a regularly arranged grid of cells. The mapping may distort the data space, but much of the local order is preserved. The original formulation of the map consists of random drawing of examples from the data sample and incremental adjustments to the map parameters after each example is shown. When this process is modeled as a adjustment is said to stochastic approximation, theP step size αi for parameter P have to satisfy the conditions i αi = ∞ and i αi2 < ∞. However, because the map is useful as an algorithm running on a computer, and algorithms must terminate in a finite number of iterations, these conditions are difficult to observe in practice. When the sample is finite or changes slowly, the average move of the parameter adjustment has been used to study the behavior of the selforganizing map. Such a study has been done by Erwin, Obermayer, and Schulten (1992), who conclude that “no strict proof of convergence is likely to be possible.” They also show that ordering is a consequence of the occurrence of a specific permutation of the sample in the drawing sequence. During random drawing, all permutations of the sample will eventually occur, and thus ordering is achieved with probability one. However, the average time for a particular permutation to occur from this random drawing is exponential, and no execution of the algorithm can support such a running time. Neural Computation 9, 1667–1676 (1997)
c 1997 Massachusetts Institute of Technology °
1668
Yizong Cheng
Kohonen (1995) eventually proposed the batch version of his self-organizing map, using mean shift instead of the average move of the batch as the adjustment of the parameters. Some variations and generalizations of the batch map have been proposed, for instance, by Mulier and Cherkassky (1995). Heskes and Kappen (1993) suggest a less straightforward winnertake-all step so that an energy function can be proposed for the process. In this article, two modifications are made to Kohonen’s batch map: (1) an adoption of the winner-take-all step suggested by Heskes and Kappen and (2) the abolition of neighborhood narrowing before the termination of the algorithm. Instead, the algorithm is run several times, with different but fixed neighborhood functions. Some of the results presented here also apply to the original Kohonen’s batch map. In section 2, an energy function similar to the objective function in Selim and Ismail’s (1984) treatment of k-means clustering is proposed. The algorithm is treated as alternating partial minimization of the energy function, and this allows an easy proof for its convergence. The general definition of the algorithm invites variations, although only a specific version is fully discussed here. In section 3, a condition called doubly decreasing is defined for the neighborhood function, and it is shown to be a sufficient condition for one-dimensional order preservation by the batch map. Section 4 discusses initialization of ordering with an unfocused map and refinement of the map by narrowing the neighborhood function on successive runs of the algorithm. 2 The Algorithm The batch map consists of two finite sets: S, the map, and D, the sample. The self-organizing process is an evolution of the assignment of S members to D members. The goal of k-means clustering, which is a special case of the batch map, is to assign the same S member to nearby D members. The goal of the self-organizing map is to assign nearby S members to nearby clusters of D. The nearness of the members of D or S is represented by considering them as subsets of some Euclidean spaces. To measure the nearness of S members, a nonnegative neighborhood function h: S×S → [0, ∞) is defined. A distance measure v: X × X → [0, ∞) is also defined over the data space X that includes D. D, S, h, and v are given and fixed, and the parameters that evolve through self-organization are a membership function u: S × D → [0, 1] with the constraint that X
u(s, x) = 1 for all x ∈ D,
(2.1)
s∈S
and a mapping z: S → X. The self-organizing algorithm is the alternating
Kohonen’s Batch Map
1669
partial minimization of the energy function f (u, z) =
XX
u(s, x)h(s, t)v(z(t), x).
(2.2)
s,t∈S x∈D
Given the current z, a u that minimizes f (u, z) is found, and based on this u, a new z that minimizes f (u, z) is found. This cycle is continued until f (u, z) is no longer decreasing. Algorithm 1.
Repeat steps 1 and 2 until f (u, z) stops decreasing.
Step 1. For each x ∈ D, select an s ∈ S that minimizes c(s, x) = v(z(t), x) and make u(s, x) = 1 and w(x) = s.
P
t∈S h(s, t)
Step 2. For each s ∈ S, let z(s) = ξ where ξ minimizes G(ξ ) =
X t∈S
h(t, s)
X
v(ξ, x).
(2.3)
w(x)=t
When v(ξ, x) = kξ − xk2 , this step is P h(w(x), s)x z(s) = Px∈D . x∈D h(w(x), s)
(2.4)
There are two significant differences between algorithm 1 and the batch map proposed in Kohonen (1995). First, Kohonen’s algorithm simply assigns the s ∈ S with z(s) nearest to x ∈ X as the “winner” w(x). Second, the neighborhood function h narrows during the execution of Kohonen’s algorithm. The reason that Heskes and Kappen’s winner-take-all step is used in algorithm 1 is based on the concern on convergence implied by Erwin et al. (1992). In section 4, it will be demonstrated that order can be initialized and refined using different neighborhood functions on different runs of algorithm 1, and thus neighborhood narrowing is factored out from the algorithm itself, to make the convergence proof more straightforward. Theorem 1.
Algorithm 1 terminates in finitely many steps.
Proof. First, we prove that step 1 is a partial minimization of P the energy P function f (u, z) with z fixed. The energy function is f (u, z) = s∈S x∈D u(s, x)c(s, x). Suppose s ∈ S minimizes c(s, x) for a particular x ∈ D, but u(s, x) < 1. Then there must be a t 6= s such that u(t, x) > 0. Let u0 be the same as u except u0 (s, x) = u(s, x)+u(t, x) and u0 (t, x) = 0. Then f (u, z)− f (u0 , z) = u(s, x)c(s, x)+u(t, x)c(t, x)−u0 (s, x)c(s, x) = u(t, x)[c(t, x)−c(s, x)] ≥ 0. Hence u0 is a choice at least as good as u. P a u is chosen in step 1, the energy function becomes f (u, z) = P Once x∈D t∈S h(w(x), t)v(z(t), x), and u is no longer a variable. We have to
1670
Yizong Cheng
determine z(t) for every t ∈ S so that f (u, z) is minimized. In f (u, z), the terms involving each z(t) can be separated and summed up to G(z(t)) = P x∈D h(w(x), t)v(z(t), P x). Hence for each t ∈ S, z(t) should be the ξ that minimizes G(ξ ) = x∈D h(w(x), t)v(ξ, x), and this is the same as step 2. During each iteration of algorithm 1, a winner assignment is chosen in step 1. This choice determines a z from step 2 and a lower f (u, z). Since there are no more than |S||D| different winner assignments, algorithm 1 terminates in finitely many steps. 3 Order Preservation Algorithm 1 is the iteration of two partial minimization steps. The current z determines the next w, which in turn determines the next z. To show that once the map is ordered, the order is preserved, we must show that an ordered z leads to an ordered w, and an ordered w leads to an ordered z. Ordering is well defined only in one-dimensional cases. It is often assumed that the topology preservation of a multidimensional map on multidimensional data is a natural consequence of one-dimensional order preservation. In the discussion of ordering, we assume that S is one-dimensional, with si < sj when i < j. We assume that D members are aligned in X, a Euclidean space. We assume that equation 2.4 is the adjustment formula for z in step 2 of algorithm 1. Because z(si ) is a weighted average of D members, they must be aligned too. We say that w is ordered if in one of the two directions in the one-dimensional subspace spanned by D, x < y implies w(x) ≤ w(y). We say that z is ordered if i < j implies z(si ) ≤ z(sj ), in a direction in X. The proofs of the order preservation results in this section require a property of the neighborhood function, defined below. Definition 1. A function h: [0, ∞) → [0, ∞) is said to be doubly decreasing if h(a) is decreasing and for any a, b, c ≥ 0 with h(a + b + c) > 0, h(a + c) h(a) ≤ . h(a + b) h(a + b + c)
(3.1)
When h is differentiable, the condition in equation 3.1 is equivalent to the condition that both h(a) and d log h(a)/da are decreasing. With 0 < r < 1 and 0 < p, a group of exponential-decay-type functions is doubly decreasing: p
h(a) = r a .
(3.2)
A special case of this group is the gaussian function, h(a) = e−a
2
/σ 2
.
(3.3)
Kohonen’s Batch Map
1671
Also, the triangle function, ½ 1 − σa , a < σ h(a) = 0, a≥σ
(3.4)
for σ > 0, is doubly decreasing. Lemma 1. ri =
Let si < sj when i < j. Let h be doubly decreasing. Then, for each q, h(|si − sq |) h(|si − sq+1 |)
(3.5)
is a decreasing sequence. Proof. Let i < j, a = sj − si , and c = sq+1 − sq . We show that ri ≥ rj in three cases—i < j ≤ q, q + 1 ≤ i < j, and i = q and j = q + 1—and thus show that ri is decreasing. Case 1: Let b = sq − sj . ri ≥ rj is h(a + b)/ h(a + b + c) ≥ h(b)/ h(b + c). Case 2: Let b = si − sq+1 , and ri ≥ rj is h(b + c)/ h(b) ≥ h(a + b + c)/ h(a + b). The inequalities in these two cases are the same as that in equation 3.5. Case 3: ri ≥ rj is h(0)/ h(c) ≥ h(c)/ h(0), which is true because h is decreasing. P P Lemma 2. Let ai ≥ 0 and bi ≥ 0 for i = 1, . . . , n and ni=1 ai = ni=1 bi . Let · · ≤ cn . If there c1 ≤ c2 ≤ ·P P is a k such that ai ≥ bi when i ≤ k and ai ≤ bi when i > k, then ni=1 ai ci ≤ ni=1 bi ci . Proof.
We must have
n X i=1
bi ci −
n X
Pk
ai ci =
i=1
≥
i=1 (ai n X
− bi ) =
Pn
i=k+1 (bi
(bi − ai )ci −
k X
(ai − bi )ci
i=k+1
i=1
n X
k X
(bi − ai )ck −
i=k+1
− ai ). Therefore,
(ai − bi )ck = 0.
(3.6)
i=1
Theorem 2. Let the neighborhood function h(s, t) = h(|s−t|) be doubly decreasing. If prior to step 2 of algorithm 1, w is ordered, then after step 2, z is ordered, · < xn and w(xi ) ≤ w(xj ) when i < j. Let Proof. P Let D be x1 < x2 < · ·P z(sq ) = ni=1 ai xi and z(sq+1 ) = ni=1 bi xi . When bi > 0, ai ≥ bi is equivalent to Pn h(w(xi ), sq ) h(w(xi ), sq ) ≥ Pn i=1 = K. (3.7) ri = h(w(xi ), sq+1 ) h(w(x i ), sq+1 ) i=1
1672
Yizong Cheng
If we can show that ri ≥ rj when i < j, then there must be a k such that ri ≥ K when i ≤ k and ri ≤ K when i > k, which is the condition of lemma 2. Using lemma 2, we have z(sq ) ≤ z(sq+1 ) for all q, and the theorem is proved. To show that ri is monotone decreasing, first we notice that because w is ordered, the problem is equivalent to showing that r∗i = h(si , sq )/ h(si , sq+1 ) is decreasing, which is a consequence of lemma 1. Theorem 3. Let the neighborhood function h(s, t) = h(|s−t|) be doubly decreasing. If prior P to step 1 of algorithm 1, z is ordered, then after step 1, the minima of cq (x) = i h(sq , si )(x − z(si ))2 , mq , are ordered, in the sense that mp ≤ mq when p < q. P P Proof. The minimumPof cq (x) is mq = q , si )z(si )/ i h(sq , si ). Let mq = P i h(sP P i ai z(si ) and mq+1 = i bi z(si ) with i ai = i bi = 1, and ai , bi ≥ 0. ai ≥ bi is equivalent to P h(sq , si ) h(sq , si ) ≥ P i = K. ri = h(sq+1 , si ) i h(sq+1 , si )
(3.8)
Lemma 1 says that ri is decreasing. This means that ai ≥ bi when i ≤ k and ai ≤ bi for i > k for some k, and lemma 2 says that mq ≤ mq+1 . Theorems 2 and 3 show that algorithm 1 preserves order in the selforganizing map under a doubly decreasing neighborhood function, except a gap between the ordered minima of cp (x)’s and the ordered w. First, this gap would not be there if the original Kohonen’s batch map were the algorithm. In that case, Voronoi regions based on the distance in space X are used to determine the winner, and an ordered z naturally implies an ordered w. Second, under some mild additional assumptions, the gap can be fixed. One P of these assumptions is that t∈S h(s, t) be independent of s. In this case, each pair of cp (x)’s has exactly one intersection, and it is always the case that when x is larger than the intersection, the cp (x) with the larger mp has lower value and may win the competition. This assumption is satisfied when the map is periodic, for example, as a ring of evenly spaced points. 4 Order Initialization and Fine Tuning The preceding section shows that algorithm 1 preserves the existing order. The main effect of the algorithm is the improvement of the distribution of the map images. Using different neighborhood functions, an ordered map can be made more uniformly distributed or more focused. A more critical problem is how the order is initialized. The strategies discussed in this section rely on two degenerate cases of order, in which the set {w(x); x ∈ D} has only one or two elements. Some of
Kohonen’s Batch Map
1673
Figure 1: cp (x) parabolas with z(si )’s as labeled boxes and D points as circles at their lowest cp (x) values. The upper row of labeled boxes indicates the positions of the initial, unordered z(si )’s with i as the labels in the boxes. The lower row of boxes indicates the ordered z(si )’s after just one iteration of algorithm 1. This next z is also the final configuration when algorithm 1 terminates.
the strategies are immediately applicable to the original Kohonen’s batch map. One strategy is to make all initial z(s)’s less than all x ∈ D. In this case, there is one winner for all D members, and z is considered ordered. Another strategy is to place the initial z(s)’s between two neighboring x ∈ D, and thus there are only two winners that make w ordered. The situation becomes more complex when algorithm 1 is used. The two strategies in the preceding paragraph still work to a certain extent. But a third strategy is more general and explains how an unordered map turns into an ordered one. This strategy is to run algorithm 1 with a broad neighborhood function. This basically means choosing a large σ in the gaussian function (see equation 3.3) or the triangle function (see equation 3.4). Indeed, a σ can be found such that the difference between all applicable h values is as small as desired. In this case, all cp (x) = c(sp , x) as parabolas have their Psymmetry centers as close to each other as desired. Those with smaller i h(sp , si ) will grow more slowly and have lower minima. When the neighborhood function is broad enough, in the interval spanned by D, only one or two of these parabolas will be the lowest ones, and only one or two winners will be generated. These winners often are the extreme members of S, because
1674
Yizong Cheng
Figure 2: Algorithm 1 on a one-dimensional ring over a randomly generated D in R2 , using the gaussian neighborhood function equation 3.3. (a) X is the unit square. D is 20 randomly generated points, denoted by squares. S is a ring of 30 points. The distance between two S points is the minimum number of hops one must cross on the ring from one to the other. Initially, z(s) were randomly generated and indicated with small circles connected into a ring, reflecting the ring topology of S. (b) Algorithm 1 was applied with σ = 8 in equation 3.3. It took six iterations to reach this fixed point. (c) Then algorithm 1 was applied to the resulting map from (b), with a narrower σ = 1. Two iterations were needed before this fixed point was reached. (d) Finally, σ = 0.1 was used, and two more iterations lead to this configuration.
P their corresponding i h(sp , si ) are the smallest. When this happens, w is ordered. Figure 1 shows an example when z is unordered but w becomes
Kohonen’s Batch Map
1675
ordered, because only two of these parabolas have the lowest value in the D range. There is a significant difference between the order initialization in one iteration of algorithm 1 and the one proved in Erwin et al. (1992), which takes exponential time to accomplish. To make the map more focused and more uniformly distributed, a few successive runs of algorithm 1 with narrower neighborhood functions will be enough. This is demonstrated using Figure 2. Figure 2 shows the order initialization and fine tuning with a periodic one-dimensional S of 30 members and a two-dimensional D with 20 members. Starting with a randomly generated z (circles connected with line segments indicating their neighboring relationship in Figure 1a), a broad gaussian neighborhood function is used to initialize the order. The result is a miniature copy of the S structure at the centroid of D (see Figure 2b). Then two runs of the algorithm with narrower neighborhood functions fine-tune the map so it is focused. The total number of iterations in all three runs of algorithm 1 is 10. 5 Concluding Remarks The convergence and ordering of Kohonen’s batch map with Heskes and Kappen’s winner selection are discussed and proved with some additional conditions, including the doubly decreasing property of the neighborhood function. The ordering results are applicable to the original Kohonen’s batch map. The general form of algorithm 1 also invites speculation on other variations of the batch map. The constraint on the membership function u can be replaced, and fuzzy winners or multiple winners will emerge. When v(x, y) = kx − yk, a median-shift instead of mean-shift batch map is created, and a more robust and more uniformly distributed map may be obtained. (A separate article will deal with the case when the data space is a sphere.) Ordering is only the one-dimensional case of topology preservation and does not explain all the attractions a self-organizing map brings and is useful for. Some measures designed for multidimensional topology preservation have been proposed, and topology preservation of algorithm 1 using measures other than one-dimensional order has yet to be studied. References Erwin, E., Obermayer, K., & Schulten K. (1992). Self-organizing maps: Ordering, convergence properties and energy functions. Biological Cybernetics, 67, 47–55. Heskes, T. M., & Kappen, B. (1993). Error potentials for self-organization. ICNN’93 (pp. 1219–1233). San Francisco. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag.
1676
Yizong Cheng
Mulier, F., & Cherkassky, V. (1995). Self-organization as an iterative kernel smoothing process. Neural Computation, 7, 1165–1177. Selim, S. Z., & Ismail, M. A. (1984). K-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Analysis and Machine Intelligence, 6, 81–86. Received August 26, 1996; accepted March 14, 1997.
LETTERS
Communicated by Jack Cowan
Solitary Waves of Integrate-and-Fire Neural Fields David Horn Irit Opher School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 69978, Israel
Arrays of interacting identical neurons can develop coherent firing patterns, such as moving stripes that have been suggested as possible explanations of hallucinatory phenomena. Other known formations include rotating spirals and expanding concentric rings. We obtain all of them using a novel two-variable description of integrate-and-fire neurons that allows for a continuum formulation of neural fields. One of these variables distinguishes between the two different states of refractoriness and depolarization and acquires topological meaning when it is turned into a field. Hence, it leads to a topologic characterization of the ensuing solitary waves, or excitons. They are limited to pointlike excitations on a line and linear excitations, including all the examples noted above, on a twodimensional surface. A moving patch of firing activity is not an allowed solitary wave on our neural surface. Only the presence of strong inhomogeneity that destroys the neural field continuity allows for the appearance of patchy incoherent firing patterns driven by excitatory interactions. 1 Introduction Can one construct a self-consistent description of neuronal tissue on the mm scale? If so, can one use for this description the same variables that characterize a single neuron? The answers to these questions are not obvious. Assuming they are affirmative, we are led to a theory of neural fields, describing the characteristic features of neurons at a given location and specific time. This necessitates continuity of neural variables, which may be quite natural in view of the large overlap between neighboring neurons and the functional maps observed in different cortical areas showing that nearby stimuli affect neighboring neurons. A model of neuronal tissue composed of two aggregates (excitatory and inhibitory) was proposed long ago by Wilson and Cowan (1973). They described three different dynamical modes that correspond to different forms of connectivity. One of their interesting observations is the formation of pairs of unattenuated traveling waves that move in opposite directions as a result of a localized input. The propagation velocity depends on the interactions and the amount of disinhibition in the network. Building on this approach, Neural Computation 9, 1677–1690 (1997)
c 1997 Massachusetts Institute of Technology °
1678
David Horn and Irit Opher
Amari (1977) analyzed pattern formation in continuum distributions of neurons or neural fields. He pointed out the possibility of local excitation solutions and derived the conditions for their formation in one dimension. Ermentrout and Cowan (1979) studied layers of excitatory and inhibitory neurons interacting with one another in two dimensions and obtained the formations of moving stripes, or rolls, hexagonal lattice patterns, and other forms. They pointed out that if their model is applied to V1, it can provide an explanation of drug-induced visual hallucinations (Kluver, ¨ 1967), relying on the retinocortical map interpretation of V1. A simpler derivation of this result is provided by Cowan (1985), who considered a single type of neural field on a two-dimensional manifold using a difference of gaussians (DOG) (or “Mexican hat”) interaction with close-by strong excitation surrounded by weak inhibition. Similar questions were recently investigated by Fohlmeister, Gerstner, Ritz, and van Hemmen (1995), who have structured their neural layer according to the spike response model of Gerstner and van Hemmen (1992). Their simulations exhibit stripe formations, rotating spirals, expanding concentric rings, and collective bursts. The simulations of Fohlmeister et al. (1995) are an example of how the advent of large-scale computations allows one to attack elaborate neuronal models in the simulation of cortical tissues. This point was made by Hill and Villa (1994), who have studied two-dimensional arrays of 10,000 neurons and obtained firing patterns in the form of clusters or patches. Moving patches were also observed by Usher, Stemmler, Koch, and Olami (1994), who investigated a plane of integrate-and-fire neurons with DOG interactions. This brings us to an interesting question. To what extent can one expect calculations on a grid of 100 × 100 neurons to reflect the behavior of a neural tissue that contains several orders of magnitude more neurons? Clearly, if the two questions raised at the beginning of this article are answered in the affirmative, there is a good chance of obtaining meaningful results. In this article, we will characterize allowed excitation formations that obey continuity requirements and show what happens when continuity is broken. Some of the structures found in the different investigations move in a fashion that conserves their general form unless they hit and merge or destroy one another. This is reminiscent of the particle property of solitons, which are known to arise in nonlinear systems (see, e.g. Newell, 1985). Yet some differences exist. In particular, the coherent firing structures annihilate during collision rather than stay intact. Therefore, an appropriate characterization is that of solitary waves (Meron, 1992). We propose using the term excitons to describe moving solitary waves in an excitable medium. Our study is geared toward investigating these structures. We will introduce a topological description that will help us identify the type of coherent structures that are allowed under our assumptions. In particular, we will see that an object with the topological characteristics of a moving circle (patch of firing activity) is not an allowed solitary wave on a two-dimensional manifold.
Integrate-and-Fire Neural Fields
1679
2 Integrate-and-Fire Neurons Integrate and fire (I & F ) neurons have been chosen for simulating large systems of interacting neurons by many authors (e.g., Usher et al., 1994). For our purpose we need a formulation of this system in which the basic variables are continuous and differentiable. We use two such variables: v, which is a subthreshold potential, and m, which distinguishes between two different modes in the dynamics of the single neuron, the active depolarization mode and the inactive refractory period: v˙ = −kv + α + cmv + mI
(2.1)
˙ = −m + 2(m − v). m
(2.2)
2(x) is the Heaviside step function. The neuron is influenced by a constant external input I, which is absent in the absolute refractory period, when m = 0. Starting out with m = 1, the total time derivative of v is positive, and v follows the dynamics of a charging capacitor. Hence, this represents the depolarization period of v. During all this time, since v < m, m stays unchanged. The dynamics change when v reaches the threshold that is arbitrarily set to 1. Then m decreases rapidly to zero, causing the time derivative of v to be negative, and v follows the dynamics of a discharging capacitor. Parameters are chosen such that the time constants of the charging and discharging periods are different. To complete this description of an I & F neuron, we need a quantity that represents the firing of the neuron. We introduce for this purpose f = m(1 − m)v,
(2.3)
which vanishes at almost all times except when v arrives at the threshold and m changes from 1 to 0.1 This can serve therefore as a description of the action potential. An example of the dynamics of v and m is shown in Figure 1. In a third frame we plot v + af , with a = 8, representing the total soma potential. The value of a is of no consequence in our work. It is used here for illustration purposes only. Our description is different from the FitzHugh-Nagumo model (FitzHugh, 1961), which is also a two-variable system. Their variables correspond to a two-dimensional projection of the Hodgkin-Huxley equations, and their nonlinearity is of third power rather than a step function. We believe that for the purpose of a description that leads to an intuitive insight, it is important to use variables whose meaning is clear, as long as they can lead to a coarse reconstruction of the neurons’ behavior. 1 f gets a small contribution also when m changes from 0 to 1, but since it is several orders of magnitude smaller, it does not have any computational consequences. An alternative choice of f can be proportional to − dm θ (− dm ). dt dt
1680
David Horn and Irit Opher
v
1 0.5 0 0
50
100
150
200
250
50
100
150
200
250
50
100
150
200
250
m
1 0.5
v+8*f
0 0 4 2 0 0
Figure 1: The dynamics of the single I & F neuron. The upper frame displays v, the subthreshold membrane potential, as a function of time. The second frame shows m, the variable that distinguishes between the depolarization state, m = 1, and refractoriness, m = 0. In the third frame we plot v + 8 f , where f is our spike profile, to give a schematic presentation of the total cell potential. Parameters for this figure, as well as all one-dimensional simulations, are: k = 0.015, α = −0.02, c = 0.0135, and I = 0.05.
3 I & F Neurons on a Line At this point we introduce interactions between the neurons and see what kind of activity patterns emerge. As we are dealing with pulse coupling, an interaction is evoked by spiking of other neurons, with or without delays. Thus, equation 2.1 is replaced by v˙i = −kvi + α + cmi vi + mi (I + 6j Wij fj ),
(3.1)
where i = 1, . . . , N represents the location of the neuron. Clearly the behavior of such a system of interacting neurons is strongly dependent on the property of the interaction matrix W. There are indications in V1 that excitations occur between neurons that lie close by (Marlin, Douglas, & Cynader, 1991) and inhibition is the rule further out (Worg ¨ otter, ¨ Niebur, & Koch, 1991), leading to a type of center-surround pattern of interactions. To be even closer to biological reality, we should introduce different descriptions for excitatory (pyramidal) neurons and inhibitory interneurons. For simplicity of presentation, we consider only one set of neurons undergoing both types of interactions. We may think of the neurons as pyramidal cells, with inhibition being an effective interaction.
Integrate-and-Fire Neural Fields
1681
f
t
m
t
x
x
Figure 2: Creation, annihilation, and motion of excitons in the x − t plane. For fixed x, they (black lines, left frame) appear on the border between m = 1 (black) and m = 0 (white) areas, shown in the right frame.
In Figure 2 we show a spatiotemporal pattern that emerges from such a system. We assume here that the neurons are located on a line with open boundary conditions. The interaction is assumed to be simultaneous and to have a finite width on this line. In all our simulations, we start with random initial conditions for vi while all mi = 1. I has the same constant value for all neurons. After some transitional period, the system settles into a spatiotemporal pattern that is a limit cycle. As the interaction strength is being increased, the behavior of vi and mi becomes a smooth function of the location i of the neuron. When this happens, we may replace vi and mi by continuous fields v(x, t) and m(x, t). Our problem may then be redefined by a set of equations for continuous variables. For example, equation 3.1 becomes: µ ¶ Z dv = −kv + α + cmv + m I + w(x, y) f (y)dy , (3.2) dt and equations 2.2 and 2.3 remain valid for the continuous fields. Once we are in the continuum regime, the number of neuronal units that we use in our simulations is immaterial. They represent a discretized approximation to a continuum system. We then have a global description of a neural tissue in terms of two continuous variables. f (x, t) represents a volley of action potentials originated at time t by somas of neurons located at x. This is usually regarded as the interesting object to look at. As we will see, it forms one of the borderlines of regions of m in (x, t) space-time. We propose looking at m(x, t) in order to understand the emergent behavior of this system. Looking again at Figure 2, we see characteristic solitary wave behavior. From time to time, pairs of excitons are created. In each pair, one moves to the right and the other to the left. The right moving exciton of the left pair
1682
David Horn and Irit Opher
1
t −−−−− >
x
−−−−− >
0
x
Figure 3: Two excitons propagating along a ring. The left frame shows the firing in the x − t plane. The right frame shows m(x) (solid line) and f (x) (dash-dotted line) at a fixed time step ( f was rescaled by a factor of 3 for the purpose of illustration). The arrows designate the direction of propagation. Note that for fixed x, firing occurs when m = 1 turns into m = 0. For fixed t, the m = 0 region lies behind the moving exciton.
collides with the left moving exciton of the right pair, and they annihilate each other. That this has to be the case follows from the fact that firing occurs when an m = 1 neighborhood turns into an m = 0 one. In the continuum limit, m(x, t) forms continuous regions with values close to either 1 or 0, thus providing us with a natural topologic argument: the borderline in (x, t) can represent either a moving exciton or the creation or annihilation of a pair of excitons. Figure 3 shows an example in which we close the line of neurons to form a ring; that is, we use periodic rather than open boundary conditions. In this example, we observe continuous motion of two excitons around the ring. The slope in the left frame is determined by the propagation velocity of the exciton. As in other models of neural excitation (e.g., Wilson & Cowan, 1973), the velocity depends on the interaction matrix Wij . In our model, stronger excitation leads to higher velocities (milder slopes), and the span of interactions determines the number of excitons that coexist. 4 Excitons in Two-Dimensional Space Let us turn now to two-dimensional space and investigate a square grid of I & F neurons. We will use either DOG interactions, Wij = CE exp(−d2ij /dE ) − CI exp(−d2ij /dI ), where dij represents the Euclidean distance between two neurons, or short-range excitatory connections. As will be demonstrated below, we obtain solitary waves that are lines or curves. This is to be expected since they should occur at the boundaries of m = 0 and m = 1 regions, which, in two dimensions, are one-dimensional structures. We find a variety of such firing patterns, depending on the span and strength of the interactions. Isolated arcs are a pattern typical of large-
Integrate-and-Fire Neural Fields
1683
f
m
y
y
x
x
Figure 4: Small arcs obtained in the case of strong inhibition. CE = 0.4, CI = 0.12, dE = 5, and dI = 40, restricted to a 20 × 20 area around each neuron on a 60 × 60 grid. The left frame shows the excitons in the form of small arcs that are fronts of moving m patches shown in the right frame. This figure, as well as all other two-dimensional simulations, uses the parameters k = 0.45, c = 0.35, α = −0.09, and I = 0.29.
y
x Figure 5: A periodic pattern of activity obtained on a 90 × 90 spatial grid, using long-range interactions: CE = 0.2, CI = 0.02, dE = 15, and dI = 100.
range interactions with (relatively) strong inhibition. An example is shown in Figure 4. Less inhibition leads to larger structures, as we can see in Figure 5. The emerging pattern is periodic in time and consists of several propagating waves that interact. It is evident that the activity begins at the corners of the grid. This is due to the open boundary conditions. They cause the corner neurons to be the least inhibited, thus rising first and exciting their
1684
David Horn and Irit Opher
a
y
b
y
x
x
Figure 6: Rotating spirals obtained on a 60 × 60 grid using DOG interactions: CE = 0.4, CI = 0.08, dE = 5, and dI = 40, restricted to a 20 × 20 area around each neuron. The left frame (a) appears prior to the right frame (b) in the simulation.
neighbors. Under similar interactions using periodic boundary conditions, we get propagating parallel stripes, the cortical patterns that Ermentrout and Cowan (1979) suggested as V1 excitations corresponding to the spiral retinal forms that appear in hallucinations. Spirals and expanding rings are well-known solitary wave formations (Meron, 1992) that are obtained also here. The former are displayed in Figure 6 and the latter in Figure 7. Both formations are often encountered in two-dimensional arrays of I & F neurons (Jung & Mayer-Kress, 1995; Milton, Chu, & Cowan, 1993). The example of expanding rings is obtained by keeping only few nearest neighbors in the interaction. The number of expanding rings is inversely related to the span of the interactions. In the example shown in Figure 7, two spontaneous foci of excitation form expanding rings, or target formations. They are displayed in three consecutive time steps. Note that firing exists only at the boundary between m = 0 and m = 1 areas. This property is responsible for the vanishing of two firing fronts that collide, because, after collision, there remains only a single m = 0 area, formed by the merger of the two former m = 0 areas. 5 The Topologic Constraint The assumption that continuous neural fields can be employed leads to very strong constraints. m(E x, t) is a continuous function on the D-dimensional manifold R 3 xE. Let us denote regions in which m ≥ 1/2 by S1 and regions where m < 1/2 by S0 . Clearly S0 + S1 = R. We have seen in Figure 1 that, in practice, m is very close to 1 throughout S1 , and very close to 0 throughout S0 . We have discussed moving solutions of firing patterns, which we called
Integrate-and-Fire Neural Fields
m
1685
y
y
m
m
y
x
x
x
f
f
f
y
y
x
y
x
x
Figure 7: Expanding rings formed on a 60 × 60 grid. The top frames show the m fields at three consecutive time steps. The bottom frames show the corresponding coherent firing patterns. In this simulation we used excitatory interactions only, coupling each neuron with its eight neighbors with an amplitude of 0.3.
excitons. In all these solutions the regions S1 and S0 change continuously with time. Since firing can occur only when a neuron switches from a state in S1 to one in S0 , excitons are restricted to domains σ that lie on boundaries between these two regions. Hence, the dimension of σ has to be D − 1. The situation is different for standing-wave solutions. These are solutions that vary with time, sometimes in a quite complicated fashion, but do not move around in space. They are high-order limit cycles. One encounters, then, a situation where a whole S1 region can switch into S0 . This will lead to a spontaneous excitation over a region of dimension D. Starting with random initial conditions we do not generate such solutions in general. However, for regular initial conditions, such as a checkerboard pattern of v = 1 and 0, such solutions emerge. The typical behavior in this case shows separation of space into two parts defined by the checkerboard pattern of initial conditions. While there is firing in one part, there is none in the other, and vice versa. The dynamics within each part can be rather complex, and does not necessarily retain all the symmetry properties that were present
1686
David Horn and Irit Opher
in the initial conditions. Since such solutions are not generated by random initial conditions, we conclude that they occupy a negligible fraction of the space of dynamical attractors. The stability of a solution can be tested by seeing how it behaves in the presence of noise. Adding fluctuations to I, in both space and time, we can test the stability of the different solutions mentioned so far. The general pattern of behavior that we find is that for small perturbations, excitons change to a small extent, while standing-wave solutions disappear. This is to be expected in view of the fact that the standing wave solutions required especially regular initial conditions. Continuity of v and m is an outcome of the excitatory interactions of neurons with their neighborhoods. We have seen it in our simulations, for both the one- and two-dimensional manifolds. It is a reflection of a wellknown property of clusters of I & F units: excitation without delays leads to coherence of firing (Mirrollo & Strogatz, 1990). We may then argue that the topologic rule will hold for all I & F systems, even ones where the field m does not appear but v is reset to 0 after firing. The interactions lead to continuity of v. Then, even in the absence of m, there always exists a relative refractory period, when v is small—for example, v < ²—in which excitations of neighboring neurons are insufficient to drive v over the threshold in the next time step. In that case we can classify the manifold into S1 and S0 regions according to whether v is larger or smaller than ². This leads to the same topologic consequences as the ones described above for our system. 6 Discussion Refractoriness is a basic property of neurons. We have embodied it in our formulation in an explicit fashion that turned into a topologic constraint for a theory of I & F neural fields. The important consequence is that coherent firing patterns that are obtained as solitary waves in our theory have a dimension that is smaller by one unit from that of the manifold to which they are attached. Thus, we obtain pointlike excitons for neural fields on a line or ring and linear firing fronts on a two-dimensional surface. The system that we discussed was always under the influence of some (usually constant) input I. Therefore, it formed a coupled oscillatory system. Alternatively, one may study a dissipative system of neural fields. This is an excitable medium that, in the absence of any input, will not generate excitons. However, given some initial conditions or some local input (Chu, Milton, & Cowan, 1994), it can generate target fomations and spiral waves. The latter are fed by the strong interactions between the neurons. If the interactions are not strong enough, these excitons will die out. In the case of weak interactions, it helps to use noisy inputs (Jung & Mayer-Kress, 1995). The latter lead to background random activity that helps to maintain the coherent formation of spirals. Sometimes there is an isotropy breaking element in the network, such as random connections or noise, that is responsible for
Integrate-and-Fire Neural Fields
1687
the abundance of spiral solutions. However, spirals can be obtained also in the absence of such elements, as is the case in our work. The details of what types of dynamic attractors are dominant depend on the type of network that one studies. Nonetheless, the refractory nature of I & F neurons guarantees that the topologic rule holds for all coherent phenomena. Once one arranges identical I & F neurons on a manifold with appropriate interactions, the continuity property of neural fields follows. These continuous fields lead to coherent neural firing of the type characterized by the excitons that are described in this article. Coherence is also maintained when we add synaptic delays that are proportional to the distance between neurons. One may now ask what happens if the continuity is explicitly broken, for example, by strong noise in the input or by randomness in the synaptic connections. What we would expect in this case is that the DOG interactions specify the resulting behavior of the system. This is indeed the case, as demonstrated in Figure 8. The resulting firing behavior is quite irregular, yet it has a patchy character with a typical length scale that is of the order of the range of excitatory interactions. We believe that this is the explanation for the moving patches of activity reported by other authors. They are incoherent phenomena, emerging in models with randomly distributed radial connections, as in Hill and Villa (1994) and Usher et al. (1994). A coherent moving patch of firing activity is forbidden on account of the continuity requirement and refractoriness.2 We learn therefore that our model embodies two competing factors. The DOG interactions tend to produce patchy firing patterns, but the ensuing coherence of identical neurons leads to the formation of one-dimensional excitons on a two-dimensional manifold. If, however, strong fluctuations exist—the neurons can no longer be described by homogeneous physiological and geometrical properties—the resulting patterns of firing activity will be determined by the form of the interactions. Abbott and van Vreeswijk (1993) have studied the conditions under which an I & F model with all-to-all coupling allows for stable asynchronous solutions. They concluded that if the interactions are instantaneous, the asynchronous state is unstable in the absence of noise. In other words, under these conditions, the system is synchronous. This obviously agrees with our results. However, they found that when the synaptic interactions have various temporal structures, asynchronous states can be stable. We, on the other hand, continue to obtain coherent solutions when we introduce delays in the synaptic interactions where the delays are proportional to distance. Clearly these two I & F models differ in both their geometrical and temporal structure. Our conclusions are limited to homogeneous I & F structures with 2 A possible exception is the case of bursting neurons. In this case, sustained activity can be maintained for a short while even when the potential v reaches threshold. Hence, the firing fronts can acquire some width. The effect depends on the relation between the duration of the burst and the velocity of the exciton.
1688
David Horn and Irit Opher
a
y
b
y
x
x
Figure 8: Incoherent firing patterns for (a) high variability of synaptic connections or (b) noisy input. In (a) we multiplied 75 percent of all synapses by a random gaussian component (mean = 1., S.D. = 3.) that stays constant in time. In (b) we employed a noisy input that varies in space and time (mean = 0.29, S.D. = 0.25). In both frames, the firing patterns are no longer excitons. We can see the formation of small clusters of firing neurons. The typical length of these patches is of the order of the span of excitatory interactions. This is a manifestation of the dominance of interactions in determining the spatial behavior in the absence of continuity that imposes the topologic constraint.
suitable interactions that lead to coherent behavior. Are there situations where coherent firing activity exists in neuronal tissue? If the explanation of hallucinatory phenomena (Ermentrout & Cowan, 1979) is correct, then this is expected to be the case. It could be proved experimentally through optical imaging of V1 under appropriate pharmacological conditions. Other abnormal brain activities, such as epileptic seizures, could also fall into the category of coherent firing patterns. Does coherence occur also under normal functioning conditions? Arieli, Sterkin, Grinvlad, and Aertsen (1996) have reported interesting spatiotemporal evoked activity in areas 17 and 18 in the cat. Do the underlying neurons fire coherently? Presumably the thalamocortical spindle waves that might be generated by the reticular thalamic nucleus (Golomb, Wang, & Rinzel, 1994; Contreras & Steriade, 1996) can be a good example of coherent activity. Another example could be the synchronous bursts of activity that propagate as wave fronts in retinal ganglion cells of neonatal mammals (Meister, Wong, Denis, & Shatz, 1991; Wong, 1993). It has been suggested that these waves play an important role in the formation of ocular dominance layers in the lateral geniculate nucleus (Meister et al., 1991). It would be interesting to have a systematic study of neuronal tissue on the mm scale in different areas and under different conditions, and learn if and when nature makes use of continuous neural fields.
Integrate-and-Fire Neural Fields
1689
References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev., E 48, 1483–1490. Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biol. Cyb., 22, 77–87. Arieli, A., Sterkin, A., Grinvlad, A., & Aertsen, A. (1996). Dynamics of ongoing activity: Explanation of the large variability in evoked responses. Science, 273, 1868–1871. Chu, P. S., Milton, J. G., & Cowan, J. D. (1994). Connectivity and the dynamics of integrate-and-fire neural networks. Int. J. of Bifurcation and Chaos, 4, 237–243. Contreras, D., & Steriade, M. (1996). Spindle oscillation in cats: The role of corticothalamic feedback in a thalamically generated rhythm. J. of Physiology, 490, 159–179. Cowan, J. D. (1985). What do drug induced visual hallucinations tell us about the brain? In W. B. Levy, J. A. Anderson, & S. Lehmkuhle (Eds.), Synaptic modification, neuron selectivity, and nervous system organization (pp. 223–241). Hillsdale, NJ: Lawrence Erlbaum. Ermentrout, G. B., & Cowan J. D. (1979). A mathematical theory of visual hallucination patterns. Biol. Cyb., 34, 136–150. FitzHugh, R. (1961). Impulses and physiological states in theoretical models of nerve membrane. Biophy. J., 1, 445–466. Fohlmeister, C., Gerstner, W., Ritz, R., & van Hemmen, J. L. (1995). Spontaneous excitation in the visual cortex: Stripes, spirals, rings, and collective bursts. Neural Comp., 7, 905–914. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of spiking neurons. Network, 3, 139–164. Golomb, D., Wang, X. J., & Rinzel, J. (1994). Synchronization properties of spindle oscillations in a thalamic reticular nucleus model. J. of Neurophys., 72, 1109– 1126. Hill, S. L., & Villa, A. E. P. (1994). Global spatio-temporal activity influenced by local kinetics in a simulated “cortical” neural network. In H. J. Herrmann, D. E. Wolf, & E. Poppel (Eds.), Supercomputing in brain research: From tomography to neural networks/workshop on supercomputing in brain research. Singapore: World Scientific. Jung, P., & Mayer-Kress, G. (1995). Noise controlled spiral growth in excitable media. Chaos, 5, 458–462. Kluver, ¨ H. (1967). Mescal and mechanisms of hallucinations. Chicago: University of Chicago Press. Marlin, S. G., Douglas, R. M., & Cynader, M. S. (1991). Position-specific adaptation in simple cell receptive fields of the cat striate cortex. Journal of Neurophysiology, 66(5), 1769–1784. Meron, E. (1992). Pattern formation in excitable media. Physics Reports, 218, 1–66. Meister, M., Wong, R. O. L., Denis, A. B., & Shatz, C. J. (1991). Synchronous bursts of action potentials in ganglion cells of the developing mammalian retina. Science, 252, 939–943.
1690
David Horn and Irit Opher
Milton, J. G., Chu, P. H., & Cowan, J. D. (1993). Spiral waves in integrate-and-fire neural networks. In S. J. Hanson, J. P. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 1001–1007). San Mateo, CA: Morgan Kauffmann. Mirrollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators, SIAM Jour. of App. Math., 50, 1645–1662. Newell, A. C. 1985. Solitons in mathematics and physics. Philadelphia: Society for Industrial and Applied Mathematics. Usher, M., Stemmler, M., Koch, C., & Olami., Z. (1994). Network amplification of local fluctuations causes high spike rate variability, fractal firing patterns and oscillatory local field potentials. Neural Comp., 6, 795–836. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Wong, R. O. L. (1993). The role of spatio-temporal firing patterns in neuronal development of sensory systems. Curr. Op. Neurobio., 3, 595–601. Worg ¨ otter, ¨ F., Niebur, E., & Koch, C. (1991). Isotropic connections generate functional asymmetrical behavior in visual cortical cells. J. Neurophysiol., 66(2), 444–459. Received July 11, 1996; accepted February 14, 1997.
Communicated by Steven Nowlan
Time-Series Segmentation Using Predictive Modular Neural Networks Athanasios Kehagias Department of Electrical Engineering, Aristotle University of Thessaloniki, GR 54006, Thessaloniki, Greece, and Department of Mathematics, American College of Thessaloniki, GR 55510 Pylea, Thessaloniki, Greece
Vassilios Petridis Department of Electrical Engineering, Aristotle University of Thessaloniki, GR 54006, Thessaloniki, Greece
A predictive modular neural network method is applied to the problem of unsupervised time-series segmentation. The method consists of the concurrent application of two algorithms: one for source identification, the other for time-series classification. The source identification algorithm discovers the sources generating the time series, assigns data to each source, and trains one predictor for each source. The classification algorithm recursively computes a credit function for each source, based on the competition of the respective predictors, according to their predictive accuracy; the credit function is used for classification of the time-series observation at each time step. The method is tested by numerical experiments. 1 Introduction For a given time series, the segmentation process should result in a number of models, each of which best describes the observed behavior for a particular segment of the time series. This can be formulated in a more exact manner as follows. A time series yt , t = 1, 2, . . . is generated by a source S(zt ), where zt is a time-varying source parameter, taking values in a finite parameter set 2 = {θ1 , θ2 , . . . , θK }. At time t, yt depends on the value of zt , as well as on yt−1 , yt−2 , . . . However, the nature of this dependence—the number K of parameter values as well as the actual values θ1 , θ2 , . . . , θK —is unknown. What is required is to find the value of zt , for t=1, 2,. . . , or, in other words, to classify yt for t=1, 2, . . . Typical examples of this problem include speech recognition (Rabiner, 1988), radar signal classification (Haykin & Cong, 1991), and electroencephalogram processing (Zetterberg et al., 1981); other applications are listed in Hertz, Krogh, and Palmer (1991). Under the assumptions that (1) the parameter set 2 = {θ1 , θ2 , . . . , θK } is known in advance and (2) a predictor has been trained on source-specific Neural Computation 9, 1691–1709 (1997)
c 1997 Massachusetts Institute of Technology °
1692
Athanasios Kehagias and Vassilios Petridis
data before the classification process starts, we have presented a general framework for solving this type of problem (Petridis & Kehagias, 1996a, 1996b; Kehagias & Petridis, 1997), using predictive modular neural networks (PREMONNs). The basic feature of the PREMONN approach is the use of a decision module and a bank of prediction modules. Each prediction module is tuned to a particular source S(θk ) and computes the respective prediction error; the decision module uses the prediction errors recursively to compute certain quantities (one for each source), which can be understood as the Bayesian posterior probabilities (probabilistic interpretation) or simply as credit functions (phenomenological interpretation). Classification is performed by maximization of the credit function. Similar approaches have been proposed in the control and estimation literature (Hilborn & Lainiotis, 1969; Sims et al., 1969; Lainiotis, 1971) and in the context of predictive hidden Markov models (Kenny, Lennig, & Mermelstein, 1990). These approaches also assume the sources to be known in advance. On the other hand, several recent articles (Jacobs, Jordan, Nowlan, & Hinton, 1991; Jordan & Jacobs 1994; Jordan & Xu, 1995; Xu & Jordan, 1996) use a similar approach but provide a mechanism for unsupervised source specification. The term mixtures of experts is used in these articles; each expert is a neural network specialized in learning the required input-output relationship for a particular region of the pattern space. There is also a gating network that combines the output of each expert to produce a better approximation of the desired input-output relationship. The activation pattern of the local experts is the output of the gating network, and it can be considered as a classifier output: when a particular expert is activated, this indicates that the current observation lies in a particular section of the pattern space. The emphasis in the mixture-of-experts method is on static patterns classification rather than classification of dynamic patterns such as time series. However, there are some applications of the mixture-of-experts method to dynamic problems of time-series classification. For example, in Prank et al. (1996), the mixture-of-experts method is applied to a problem of classifying the “pulsatile” pattern of growth hormone in healthy subjects and patients. The goal is to discriminate the pulsatile pattern of healthy subjects from that of patients. In general, the time-series corresponding to a healthy subject is more predictable than that corresponding to a patient. Classification is based on the level of prediction error for a particular time series (low error indicating a healthy subject and high error indicating a patient). The selection pattern of local experts (which expert is activated at any given time step) can also be used for classification, since the frequency distribution of most probable experts significantly differs between patients and healthy subjects. However, the activation of experts by the gating network is performed at every time step independent of previous activation patterns. This may result in the loss of important correlations across time steps, which characterize the dynamics of the time series.
Time-Series Segmentation
1693
This static approach to time-series classification may be problematic in cases where there is significant correlation between successive time steps of the time series. This has been recognized and discussed by Cacciatore and Nowlan (1994), who state that “the original formulation of this [local experts] architecture was based on an assumption of statistical independence of training pairs. . . . This assumption is inappropriate for modelling the causal dependencies in control tasks.” Cacciatore and Nowlan proceed to develop the mixture of controllers architecture, which emphasizes “the importance of using recurrence in the gating network.” The mixture of controllers is similar to the method we present in this article. However, the method presented by Cacciatore and Nowlan requires several passes through the training data; our method, on the other hand, can be operated online because it does not require multiple passes. In addition, in some cases, Cacciatore and Nowlan need to provide explicitly the active sources (for some portion of the training process) for the mixture-of-controllers network to accomplish the required control task; in our case, learning is completely unsupervised, and the active sources always remain hidden. Pawelzik, Kohlmorgen, and Muller (1996) present an approach using annealed competition of experts. The method they present is similar to the work cited earlier; it is applied to time-series segmentation and does not assume the sources to be known in advance. This method is mainly suitable for offline implementation due to the heavy computational load incurred by the annealing process. In this article we present an unsupervised time-series segmentation scheme, which combines two algorithms: (1) the source identification algorithm, which detects the number of active sources in the time series, performs assignment of data to the sources and uses the data to train one predictor for each active source, and (2) the classification algorithm, which uses the sources and predictors specified by the identification algorithm and implements a PREMONN, which computes credit functions. A probabilistic (Bayesian) interpretation of this algorithm is possible but not necessary. The combination of the two algorithms—the source identification one and the time-series classification one—constitutes the combined time-series segmentation scheme; the two algorithms can be performed concurrently and online. A crucial assumption is that of slow switching between sources. Our approach is characterized by the use of a modular architecture, with predictive modules competing for data and credit assignment. A decision module is also used, to assign the credit in a recursive manner, which takes into account previous classification decisions. A convergence analysis of the classification algorithm is available elsewhere (Kehagias & Petridis, 1997). 2 The Time-Series Segmentation Problem Consider a time series zt , t = 1, 2, . . ., taking values in the unknown set 2 = {θ1 , . . . , θK }. Further consider a time series yt , t = 1, 2, . . . that is generated
1694
Athanasios Kehagias and Vassilios Petridis
by some source S(zt ): for every value of zt a source S(zt ) is activated, which produces yt using yt−1 , yt−2 , . . . . We call zt the source parameter and yt the observation. For instance, if zt = θ1 , we could have yt = f (yt−1 , yt−2 , . . . ; θ1 ), where f (·) is an appropriate function with parameter θ1 ; θ1 can be a scalar or vector quantity or, simply, a label variable. The task of time-series segmentation consists in estimating the values z1 , z2 , . . . , based on the observations y1 , y2 , . . . The size K of the source set 2 and its members θ1 , θ2 , . . . , θK , is initially unknown. This original task can be separated into two subtasks: source identification and time-series classification. The task of source identification consists of determining the active sources or, equivalently, the source set 2 = {θ1 , . . . , θK }. Having achieved this, the time-series classification task consists of corresponding each zt (t=1, 2, . . . ) to a value θk (k = 1, 2, . . . , K) or, equivalently, determining at each time step t the source S(θk ) that has generated yt . The combination of the two subtasks yields the original time-series segmentation task. Two algorithms are presented here, one to solve each subtask. The source identification algorithm distributes data y1 , y2 , . . . , yt among active sources and uses the data to train one predictor per source. For instance, a data segment y1 , y2 , . . . , yM may be assigned to source S(θ1 ) and used to approximate some postulated relationship yt = f (yt−1 , yt−2 , . . . ; θ1 ).1 The time-series classification algorithm uses the predictors trained in the source identification phase, to compute predictions of incoming observations. The respective prediction errors are used to compute a credit function for each predictor or source; an observation yt is classified to the source of maximum credit. Each of the two algorithms has been developed using a different approach. The source identification algorithm has a heuristic motivation; its justification is experimental. On the other hand, the classification algorithm is motivated by probabilistic considerations. Although in this article its performance is evaluated only experimentally, a detailed theoretical analysis is available (Kehagias & Petridis, 1997). 3 The Time-Series Segmentation Scheme We first present the two algorithms that constitute the time-series segmentation scheme, then describe the manner in which they are combined. 3.1 The Source Identification Algorithm. Other authors have tackled the same problem using similar methods (Pawelzik et al., 1996). There are 1 y = f (y t t−1 , yt−2 , . . . ; zt ) reflects the dependence of yt on past observations, as well as on the parameter zt . In general, the true input-output relationship will be unknown and f (·) will be an approximation, such as furnished by a sigmoid neural network. In fact, due to the competitive nature of the algorithms presented, even a crude approximation suffices, as will be explained later.
Time-Series Segmentation
1695
also connections to k-means (Duda and Hart, 1973) and Kohonen’s learning vector quantizer (Kohonen, 1988); note, however, that these have been applied mainly to classification of static patterns. We now give a description of the algorithm. The following parameters will be used: L, the length of a data block; M, the length of a data segment, where M < L; K, the number of active predictors; and Kmax , the maximum allowed number of active predictors. Source Identification Algorithm. Initialization. Set K = 1. Use data block y1 , y2 , . . . , yL to train the first predictor, of the form y1t = f (yt−1 , yt−2 , . . . ; θ1 ), where f (·) is a (sigmoid) neural network and θ1 is taken to be the connection weights. Main Routine.
Loop for n = 1, 2, . . .
1. Split data block yt , t = n · L + 1, n · L + 2, . . . , (n + 1) · L (that is, L time steps) into L/M data segments (that is, M time steps, M < L). 2. Loop for k = 1, . . . , K. (a) Use each segment of yt ’s as input to the kth predictor to compute a prediction ytk = f (yt−1 , . . . ; θk ) for each time t. (b) For every segment, compare the predictions ytk with the actual observations yt (t = n · L + 1, n · L + 2, . . . , (n + 1) · L) to compute the segment-average square prediction error (henceforth SASPE). End of k loop. 3. Loop for every segment in the block (a total of L/M segments). (a) For the current segment, find the predictor that produces the minimum SASPE. (b) For this predictor, compare minimum SASPE with a respective threshold.2 (c) In case SASPE is above the threshold and K < Kmax , set K = K+ 1 and assign the current segment to a new predictor; otherwise, assign the segment to the predictor of minimum SASPE. End of segment loop. 4. Loop for k = 1, . . . , K.
2
This threshold is computed separately for each predictor.
1696
Athanasios Kehagias and Vassilios Petridis
(a) Retrain the kth predictor for J iterations, using the L/M most current data segments assigned to it. End of k loop. End of n loop. It is assumed that the switching rate is slow enough so that the initial segment will mostly contain data from one source; however, some data from other sources may also be present. The modular and competitive nature of the algorithm is evident. A data segment is assigned to a predictor even if the respective prediction is poor; what matters is that it is better than the remaining ones. Hence, every predictor is assigned data segments on the assumption that they were generated by a particular source and the predictor is trained (specialized) on this source. Training can be performed independently, in parallel, for each predictor. Numerical experimentation indicates that the values of several parameters may affect performance: 1. L is the length of data block. If L is too large, then training for each data block requires a long time. If L is too small, then not enough data are available for the retraining of the predictors. For the experiments reported in section 4, we have found that values around 100 are sufficient to ensure good training of the predictors without incurring excessive computational load. 2. M is the length of data segment; it is related to the switching rate of zt . Let Ts denote the minimum number of time steps between two successive source switches. While Ts is unknown, we operate on the assumption of slow switching, which means that Ts will be relatively large. Since all the M data points included in a segment will be assigned to the same source (and used to train the same predictor), it is obviously desirable that they have been truly generated by the same source; otherwise the respective predictor will not specialize on one source. In reality, however, this is not guaranteed to happen. In general, a small value of M will increase the likelihood that most segments contain data from a single source. On the other hand, it has been found that small M leads to an essentially random assignment of data points to sources, especially in the initial stages of segmentation, when the predictors have not specialized sufficiently. The converse situation holds for large M. In practice, one needs to guess a value for Ts and then take M somewhere between 1 and Ts . This choice is consistent with the hypothesis of slow switching rate. The practical result is that most segments in a block contain data from exactly one source, and a few segments contain data from two sources. The exact value of Ts is not required to be known; a rough guess suffices.
Time-Series Segmentation
1697
3. J, the number of training iterations, should be taken relatively small, since the predictors must not be overspecialized in the early phases of the algorithm, when relatively few data points are available. 4. The error threshold is used to determine if and when a new predictor should be added. At every retraining epoch, the training prediction error, Ek , is computed for k = 1, 2, . . . , K. Then the threshold, Dk , is set equal to a · Ek , where a is a scale factor. A high a value makes the introduction of new predictors difficult, since large prediction errors are tolerated; conversely, a low a value facilitates the introduction of new predictors. 5. In practice a value Kmax is set to be the maximum number of predictors used. As long as this is equal to or larger than the number of truly active sources, it does not affect the performance of the algorithm seriously. A further quantity of interest is T, the length of time series. Theoretically this is infinite, since the algorithm is online, with new data becoming continuously available. However, it is interesting to note the minimum number of data required for the algorithm to perform the source identification successfully. In the experiments presented in section 4, it is usually between 1000 and 3000. This is the information that the algorithm requires to discover the number of active sources and to train respective predictors. In addition, the time required to process the data is short, because only a few training iterations are performed per data segment; these computations are performed while waiting for the new data segment to be accumulated. The short training time and the small amount of data required make the source identification algorithm very suitable for online implementation. 3.2 The Classification Algorithm. The classification algorithm has already been presented in Petridis and Kehagias (1996b) and Kehagias and Petridis (1997) under the name of predictive modular neural network (PREMONN). A brief summary is given here. It is assumed that the source parameter time series zt , t = 1, 2, . . ., takes values in the known set 2 = {θ1 , . . . , θK } (this has been specified by the source identification algorithm). Corresponding to each source S(θk ), there is a predictor (e.g., a neural network trained by the source identification algorithm) characterized by the following equation (k = 1, 2, . . . , K): ytk = f (yt−1 , yt−2 , . . . ; θk ).
(3.1)
At every time step t, the classification algorithm is required to identify the member of 2 that best matches the observation yt . Using a probabilistic approach, one can define ptk = Pr(zt = θk |y1 , y2 , . . . , yt ). In Petridis and
1698
Athanasios Kehagias and Vassilios Petridis
Kehagias (1996a) the following ptk update is obtained:
ptk =
k ·e pt−1
−
PK
|yt −y k |2 t σ2
l l=1 pt−1 · e
−
|yt −yl |2 t σ2
.
(3.2)
Equation (3.2) is a recursive implementation of Bayes’ rule. The recursion is started using p0k , the prior probability assigned to θk at time t = 0 (before any observations are available). In the absence of any prior information, all models are assumed equally credible: p0k =
1 for k = 1, 2, . . . , K. K
(3.3)
Finally, zˆ t (the estimate of zt ) is chosen as . zˆ t = arg max ptk . θk ∈2
(3.4)
Hence the PREMONN classification algorithm is summarized by equations 3.1 through 3.4. This is a recursive implementation of Bayes’ rule; the value ptk reflects our belief (at time t) that the observations are produced by a model with parameter value θk . The probabilistic interpretation of the algorithm, presented above, is not necessary. A different, phenomenological, interpretation goes as follows. ptk is a credit function that evaluates the performance of the kth source in predicting the observations y1 , y2 , . . . , yt . This evaluation depends on −
|yt −y k |2 t
two factors: e σ 2 reflects how accurately the current observation was k reflects past performance. Hence equation 3.2 updates predicted, and pt−1 k the credit function pt recursively. The update takes into account the previous k value pt−1 ; this makes ptk implicitly dependent on the entire time-series history. The final effect is that correlations between successive observations are captured. Note also the competitive nature of classification, which is evident in equation 3.2: even if a predictor performs poorly, it may still receive high credit as long as the remaining predictors perform even worse. Kehagias and Petridis (1997) proved that equation 3.2 converges to appropriate values that ensure classification to the most accurate predictor. In the case of slow switching (as assumed in this article), convergence to the active source is guaranteed for the time interval between two source switchings, provided that the convergence rate is fast, compared to switching rate. Again, this requirement is in accordance with the hypothesis of slow switching. The performance of the classification algorithm is influenced by two parameters, σ and h.
Time-Series Segmentation
1699
σ is a smoothing parameter. Under the probabilistic interpretation, σ is the standard deviation of prediction error. A phenomenological interpretation is also possible. For a fixed error |yt − ytk |, note that a large σ results in −
|yt −y k |2 t
a large value of e σ 2 . This is of interest particularly in the case where a predictor performs better than the rest over a large time interval but occasionally generates a bad prediction (perhaps due to random fluctuation). In −
|yt −y k |2 t
this case, a very small σ will result in a very small value of e σ 2 in equation 3.2. This, in turn, will decrease ptk excessively. On the other hand, a very large σ results in sluggish performance (a long series of large prediction errors is required before ptk is sufficiently decreased). For the experiments reported in this article, σ is determined at every time step t by the following adaptive estimate: qP σ =
K k=1 |yt
− ytk |2
K
.
A threshold parameter h is also used in the classification algorithm. Suppose that zt remains fixed for a long time at, say, θ1 ; then at time ts , a switch −
|yt −y1 |2 t
−
|yt −y2 |2 t
to θ2 occurs. For t < ts , e σ 2 is large and e σ 2 small (much less than one). Considering equation 3.2, one sees that p2t decreases exponentially; after a few steps it may become so small that computer underflow takes place, and p2t is set to 0. It follows from equation 3.2 that after that point, p2t will remain 0 for all subsequent time steps; in particular, this will also be true after ts , despite the fact that now the second predictor performs well −
|yt −y2 |2 t
(e σ 2 will be large, but multiplied by p2t−1 will still yield p2t = 0). Hence the currently best predictor receives zero credit and is excluded from the classification process. What is required is to ensure that p2t will never become too small. Hence the following strategy is used: whenever ptk falls below h, it is reset to h; if h is chosen small (say around 0.01), then this resetting does not influence the classification process, but also gives predictors with low credit the chance to recover quickly, once the respective source is activated. 3.3 The Time-Series Segmentation Scheme. The time-series segmentation scheme consists of the concurrent application of the two algorithms. The classification algorithm is run on the current data block, using the predictors discovered and trained by the source identification algorithm in the previous epoch3 (operating on the previous data block). Concurrently, the identification algorithm operates on the current block and provides an update of the predictors, but this update will be used only in the next epoch 3
An epoch is defined as the interval of L time steps that correspond to a data block.
1700
Athanasios Kehagias and Vassilios Petridis
of the segmentation scheme. Hence, the classification algorithm uses information provided by the source identification algorithm, but not vice versa. The identification algorithm in effect provides a crude classification by assigning segments to predictors; the role of the classification algorithm is to provide segmentation at the level of individual time steps. Comparatively speaking, in the initial stages of the segmentation process, the identification algorithm is more important than the classification algorithm and the predictors change significantly at every epoch; at a later stage, the predictors reach a steady state, and the source identification algorithm becomes less important. In fact, a reduced version of the segmentation scheme could use only the identification algorithm initially (until the predictors are fairly well defined) and only the classification algorithm later (when the predictors are not expected to change significantly). However, for this reduced version to work, it is required that new sources are not activated at a later stage of the time-series evolution. 4 Experiments Two sets of experiments are presented in this section to evaluate the segmentation scheme. 4.1 Four Chaotic Time Series. Consider four sources, each generating a chaotic time series: 1. For zt = 1, a logistic time series of the form yt = f1 (yt−1 ), where f1 (x) = 4x(1 − x). 2. For zt = 2, a tent map time series of the form yt = f2 (yt−1 ), where f2 (x) = 2x if x ∈ [0, 0.5) and f2 (x) = 2(1 − x) if x ∈ [0.5, 1]. 3. For zt = 3, a double logistic time series of the form yt = f3 (yt−1 ) = f1 ( f1 (yt−1 )). 4. For zt = 4, a double tent map time series of the form yt = f4 (yt−1 ) = f2 ( f2 (yt−1 )). The four sources are activated consecutively, each for 100 time steps, giving an overall period of 400 time steps. Ten such periods are used, resulting in a 4000-step time series. The task is to discover the four sources and the switching schedule by which they are activated. Six hundred time steps of the composite time series (encompassing the source transition 1 → 2 → 3 → 4 → 1 → 2) are presented in Figure 1. Hence the first hundred time steps correspond to zt = 1; the second hundred time steps correspond to zt = 2; and so on. A number of experiments are performed using several different combinations of block and segment lengths; also the time series is observed at various levels of noise—that is, at every step, yt is mixed with additive white
Time-Series Segmentation
1701
Figure 1: Six hundred steps of a composite time series produced by four chaotic sources (logistic, tent map, double logistic, and double tent map) being activated alternately, for 100 time steps each: steps 1 to 100 correspond to source 1, steps 101 to 200 correspond to source 2, steps 201 to 300 correspond to source 3, steps 301 to 400 correspond to source 4, steps 401 to 500 correspond to source 1, steps 501 to 600 correspond to source 2.
noise uniformly distributed in the interval [−A/2, A/2]. Hence an experiment is determined by a combination of L, M, and A values. The predictors used are 1-5-1 sigmoid neural networks; the maximum number of predictors is Kmax = 6; the number of training iterations per time step is J = 5; the credit threshold is set at h = 0.01.4 In every experiment performed, all four sources are eventually identified. This takes place at some time Tc , which is different for every experiment. After time Tc , a classification figure of merit is computed. It is denoted by c = T2 /T1 , where T1 is the total number of time steps after Tc , and T2 is the number of correctly classified time steps after Tc . The results of the experiments (Tc and c) are listed in Tables 1 and 2. The evolution of a representative credit function, in this case p3t (the remaining five credit functions evolve in a similar manner), is presented in Figure 2A. This figure was obtained from a noise-free experiment (A = 0.0), with L = 105, M = 15. Note that classification beyond time Tc is very stable. The classification decisions (namely the source and predictor selected at every time step) for the same experiment are presented in Figure 2B. It can be observed that 4 For training the sigmoid predictors, a Levenberg-Marquardt algorithm was used; this was implemented by Magnus Norgaard, Technical University of Denmark, and distributed as part of the Neural Network System Identification Toolbox for MATLAB.
1702
Athanasios Kehagias and Vassilios Petridis
Table 1: Four Chaotic Time Series, with Switching Period 100.
A
L = 70 Tc
M = 10 c
L = 100 Tc
M = 10 c
L = 120 Tc
M = 10 c
0.00 0.05 0.10 0.15 0.20
500 1600 1800 800 2200
0.982 0.969 0.939 0.947 0.529
1200 1100 1100 1600 3900
0.976 0.968 0.962 0.934 0.899
3000 3200 1300 1400 2200
0.963 0.899 0.898 0.767 0.359
Notes: Time-series length, 4000 points. Segment length M=10, fixed; block length L variable, noise range [−A/2, A/2], A variable. Tc is the time step by which all sources have been identified; c is classification accuracy.
Table 2: Four Chaotic Time Series, with Switching Period 100.
A
L = 75 Tc
M = 15 c
L = 105 Tc
M = 15 c
L = 120 Tc
M = 15 c
0.00 0.05 0.10 0.15 0.20
900 1600 1600 1700 1600
0.976 0.971 0.954 0.943 0.130
1200 1700 2800 2600 2200
0.983 0.967 0.973 0.927 0.416
1600 1600 1800 2200 2800
0.980 0.963 0.819 0.613 0.708
Notes: Time-series length, 4000 points. Segment length M = 15, fixed; block length L variable, noise range [−A/2, A/2], A variable. Tc is the time step by which all sources have been identified; c is classification accuracy.
the four sources correspond to predictors 1, 2, 3, and 6. Predictors 4 and 5 have received relatively few data, have been undertrained, and do not correspond to any of the truly active sources; generally these predictors play no significant role in the segmentation process. 4.2 Mackey-Glass Time Series. Now consider a time series obtained from three sources of the Mackey-Glass type. The original data evolve in continuous time and satisfy the differential equation: 0.2y(t − td ) dy = −0.1y(t) + . dt 1 + y(t − td )10
(4.1)
For each source a different value of the delay parameter td was used: td = 17, 23, and 30. The time series is sampled in discrete time, at a sampling rate τ = 6, with the three sources being activated alternately, for 100 time steps each. The final result is a time series with a switching period of 300
Time-Series Segmentation
1703
Figure 2: A time-series segmentation experiment with four chaotic time series, L=105, M=15, A=0.00 (noise-free time series). (A.) Evolution of credit function p3t . (B.) Classification decisions. Note that all four sources have been identifed at time Tc = 1200. The corresponding predictors are 1, 2, 3, and 6. After Tc = 1200, all classifications are made to these four predictors or sources; predictors 4 and 5 play practically no role in the classification process.
and a total length of 4000 time steps. Six hundred time steps of this final time series (encompassing the source transition 1 → 2 → 3 → 1 → 2 → 3) are presented in Figure 3. Hence, the first hundred time steps correspond to zt = 1; the second hundred time steps correspond to zt = 2; and so on.
1704
Athanasios Kehagias and Vassilios Petridis
Figure 3: Six hundred steps of a composite time series produced by three Mackey-Glass sources (with Td equal to 17, 23, and 30, respectively) being activated alternately, for 100 time steps each: steps 1 to 100 correspond to source 1, steps 101 to 200 correspond to source 2, steps 201 to 300 correspond to source 3, steps 301 to 400 correspond to source 1, steps 401 to 500 correspond to source 2, steps 501 to 600 correspond to source 3.
Again, a particular experiment is determined by a combination of L, M, and A values. The predictors used are 5-5-1 sigmoid neural networks; the maximum number of predictors is Kmax = 6; the number of training iterations per time step is J = 5; the credit threshold is set at h = 0.01. Again, all three sources were correctly identified in every experiment. The results of the experiments, Tc and c (described in section 4.1), are listed in Tables 3 and 4. The evolution of three representative credit functions, p1t , p2t , p3t , is presented in Figures 4A–C, respectively. This figure was obtained from an experiment with A = 0.15, L = 105, M = 15. Note that classification beyond time Tc is quite stable. The classification decisions (namely the source and predictor selected at every time step) for the same experiment are presented in Figure 4D. It can be observed that the three sources correspond to predictors 1, 2, and 3. Predictors 4, 5, and 6 have received relatively few data, have been undertrained, and do not correspond to any of the truly active sources; generally these predictors play no significant role in the segmentation process. 4.3 Discussion. It can be observed from Tables 1 through 4 that the segmentation scheme takes between 1000 and 3000 steps to discover the active sources. After that point, classification is quite accurate for low to middle
Time-Series Segmentation
1705
Table 3: Three Mackey-Glass Time Series, with Switching Period 100.
A
L = 70 Tc
M = 10 c
L = 100 Tc
M = 10 c
L = 120 Tc
M = 10 c
0.00 0.05 0.10 0.15 0.20
1700 1100 1300 3500 1200
0.978 0.977 0.853 0.935 0.664
1400 1200 2700 1300 1000
0.976 0.975 0.964 0.796 0.637
3000 2200 800 1300 1200
0.925 0.977 0.943 0.945 0.839
Notes: Time series length, 4000 points. Segment length M=10, fixed; block length L variable, noise range [−A/2, A/2], A variable. Tc is the time step by which all sources have been identified; c is classification accuracy.
Table 4: Three Mackey-Glass Time Series, with Switching Period 100.
A
L = 75 Tc
M = 15 c
L = 105 Tc
M = 15 c
L = 120 Tc
M = 15 c
0.00 0.05 0.10 0.15 0.20
1400 1200 2000 1600 1200
0.966 0.974 0.947 0.946 0.618
1000 1200 1700 1000 1500
0.968 0.940 0.966 0.918 0.676
1700 2000 2900 2000 2100
0.974 0.954 0.965 0.902 0.782
Notes: Time-series length, 4000 points. Segment length M=15, fixed; block length L variable, noise range [−A/2, A/2], A variable. Tc is the time step by which all sources have been identified; c is classification accuracy.
noise levels and gradually drops off as noise level increases.5 The scheme is fast: source identification and classification of the 4000 time steps (run concurrently) require a total of approximately 4 minutes. The algorithms are implemented in MATLAB and run on a 486 120-MHz IBM-compatible machine; a significant speedup would result from a C implementation (consider that MATLAB is an interpreted language). The same segmentation tasks have been treated by the method of annealed competition of experts (ACE) in Pawelzik et al. (1996). Regarding accuracy of segmentation, both methods appear to achieve roughly the same accuracy; however, the ACE method was applied to either noise-free or low-noise data; we do not know how well the ACE method performs in the presence of relatively high noise. Execution time is not cited in Pawelzik 5 To evaluate the effect of noise, keep in mind that the chaotic time series take values in the [0, 1] interval and the Mackey-Glass time series in the [0.7, 1.3] interval.
1706
Athanasios Kehagias and Vassilios Petridis
Figure 4: continued next page
et al. (1996) for the two tasks considered here. On a third, prediction, task, the ACE method requires 2.5 hours on a SPARC 10/20 GX. In general, we expect the ACE method to be rather time-consuming, since it depends on annealing, which requires many passes through the training data. Regarding the required number of time-series observations, the ACE method uses 1200 data points for the first task and 400 data points for the second; several annealing passes are required. Our method requires a single pass through at most 3000 data points. We do not consider this an important difference. Because all time series involved are ergodic, we
Time-Series Segmentation
1707
Figure 4: A time-series segmentation experiment with three Mackey-Glass time series: L = 105, M = 15, A = 0.15. (A.) Evolution of credit functions p1t . (B.) Evolution of credit functions p2t . (C.) Evolution of credit functions p3t . (D.) Classification decisions for the credit functions of Figures 2A–B. Note that all three sources have been identifed at time Tc =1000. The corresponding predictors are 1, 2, and 3. After Tc =1000, most classifications are made to these three predictors or sources; however, some misclassifications to predictor 4 occur due to the presence of noise. Predictors 5 and 6 play practically no role in the classification process.
1708
Athanasios Kehagias and Vassilios Petridis
conjecture that processing 3000 data points once should yield more or less the same information as cycling 30 times over 100 data points.6 5 Conclusions The unsupervised time-series segmentation scheme presented combines a source identification and a time-series classification algorithm. The source identification algorithm determines the number of sources generating the observations, distributes data between the sources, and trains one predictor for each source. Then predictors are used by the classification algorithm as follows. For every incoming data segment, each predictor computes a prediction; comparing these to the actual observation, segment prediction errors are computed for each predictor. The prediction errors are used to update a credit function recursively and competitively for each predictor or source. Sources that correspond to more succesful predictors receive higher credit. Each segment is classified to the source that currently has highest credit. The combined scheme consists of the identification and classification algorithms running concurrently, in a competitive and recursive manner. The scheme is accurate and fast and requires few training data, as has been established by numerical experimentation; hence it is suitable for online implementation. It is also modular: training and prediction are performed by independent modules, each of which can be removed from the system, without affecting the properties of the remaining modules. This results in short training and development times. Finally, a convergence analysis of the classification algorithm has been published elsewhere (Kehagias & Petridis, 1997); this guarantees correct classification. Although we have not made any assumptions about the switching behavior of the source parameter, a Markovian assumption can be easily incorporated in the update equation 3.2. A convergence analysis for this case is also available. This modification affects the classification algorithm only, leaving the source identification algorithm unaffected. Due to lack of space, these developments will be presented in the future. References Cacciatore, T. W., & Nowlan, S. J. (1994). Mixtures of controllers for jump linear and nonlinear plants. In Advances in neural information processing systems 6 (NIPS 93), J. D. Cowen, G. Tesauro, & J. Alspector (Eds.), (pp. 719–726). San Mateo, CA: Morgan Kaufmann.
6 Using the same data for training and testing makes the task considered in Pawelzik et al. (1996) somewhat easier than the one considered in this article, where segmentation is evaluated on previously unseen data. But we believe that the difference is not great, due to the ergodic nature of the time series.
Time-Series Segmentation
1709
Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Haykin, S., & Cong, D. (1991). Classification of radar clutter using neural networks. IEEE Trans. on Neural Networks, 2, 589–600. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Hilborn, C. G., & Lainiotis, D.G. (1969). Unsupervised learning minimum risk pattern classification for dependent hypotheses and dependent measurements. IEEE Trans. on Systems Science and Cybernetics, 5, 109–115. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–87. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Jordan, M. I., & Xu, L. (1995). Convergence results for the EM algorithm to mixtures of experts architectures. Neural Networks, 8, 1409–1431. Kehagias, A., & Petridis, V. (1997). Predictive modular neural networks for time series classification. Neural Networks, 10, 31–49. Kenny, P., Lennig, M., & Mermelstein, P. (1990). A linear predictive HMM for vector-valued observations with applications to speech recognition. IEEE Trans. on Acoustics, Speech and Signal Proc., 38, 220–225. Kohonen, T. (1988). Self organization and associative memory. New York: SpringerVerlag. Lainiotis, D. G. (1971). Optimal adaptive estimation: Structure and parameter adaptation. IEEE Trans. on Automatic Control, 16, 160–170. Pawelzik, K., Kohlmorgen, J., & Muller, K. R. (1996). Annealed competition of experts for a segmentation and classification of switching dynamics. Neural Computation, 8, 340–356. Petridis, V., & Kehagias, A. (1996a). A recurrent network implementation of time series classification. Neural Computation, 8, 357–372. Petridis, V., & Kehagias, A. (1996b). Modular neural betworks for MAP classification of time series and the partition algorithm. IEEE Trans. on Neural Networks, 7, 73–86. Prank, K., Kloppstech, M., Nowlan, S. J., Sejnowski, T. J., & Brabant, G. (1996). Self-organized segmentation of Time Series: Separating growth hormone secretion in acromegaly from normal controls. Biophysical Journal, 70, 2540–2547. Rabiner, L. R. (1988). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77, 257–286. Sims, F. L., Lainiotis, D. G., & Magill, D. T. (1969). Recursive algorithm for the calculation of the adaptive Kalman filter coefficients. IEEE Trans. on Automatic Control, 14, 215–218. Xu, L., & Jordan, M. I. (1996). On convergence properties of the EM algorithm for gaussian mixtures. Neural Computation, 8, 129–151. Zetterberg, L. H., Patric, M. & Scott, D. (1981). Computer analysis of EEG signals with parametric models. Proc. IEEE, 69, 451–461.
Received August 20, 1996, accepted January 29, 1997.
Communicated by Peter Dayan
Adaptive Mixtures of Probabilistic Transducers Yoram Singer AT&T Labs, Florham Park, NJ 07932, U.S.A.
We describe and analyze a mixture model for supervised learning of probabilistic transducers. We devise an online learning algorithm that efficiently infers the structure and estimates the parameters of each probabilistic transducer in the mixture. Theoretical analysis and comparative simulations indicate that the learning algorithm tracks the best transducer from an arbitrarily large (possibly infinite) pool of models. We also present an application of the model for inducing a noun phrase recognizer. 1 Introduction Supervised learning of probabilistic mappings between temporal sequences is an important goal of natural data analysis and classification with a broad range of applications, including handwriting and speech recognition, natural language processing, and biological sequence analysis. Research efforts in supervised learning of probabilistic mappings have focused almost exclusively on estimating the parameters of a predefined model. For example, Giles et al. (1992) used a second-order recurrent neural network to induce a finite-state automaton that classifies input sequences, and Bengio and Fransconi (1994) introduced an input-output hidden Markov model (HMM) architecture that was used for similar tasks. In this article, we introduce and analyze an alternative approach based on a mixture model of a new subclass of probabilistic transducers, which we call suffix tree transducers. Mixture models, often referred to as mixtures of experts, are a powerful approach both theoretically and experimentally. See DeSantis, Markowski, and Wegman (1988); Jacobs, Jordan, Nowlan, and Hinton (1991); Haussler and Barron (1993); Littlestone and Warmuth (1994); Cesa-Bianchi et al. (1993); and Helmbold and Schapire (1995) for analyses and applications of mixture models, from different perspectives, such as connectionism, Bayesian inference, and computational learning theory. A key advantage of the mixture of experts architecture is the ability to attack problems by dividing them into simpler subproblems, whose solutions can be found efficiently and then combined to yield a solution to the original problem. Thus, the common approach used in learning complex mixture models is based on the principle of divide and conquer, which usually results in a greedy search for a good model (Breiman, Friedman, Ohlsen, & Stone, Neural Computation 9, 1711–1733 (1997)
c 1997 Massachusetts Institute of Technology °
1712
Yoram Singer
1984; Quinlan, 1993) or in a hill-climbing algorithm that (locally) maximizes a utility function (Jacobs et al., 1991; Jordan & Jacobs, 1994). In this article we consider a restricted class of models, which enables us to devise an alternative learning approach. The alternative suggested is to calculate the evidence associated with each possible model in the mixture efficiently rather than searching for a single model. By combining techniques used for compression and unsupervised learning (Willems, Shtarkov, & Tjalkens, 1995; Ron, Singer, & Tishby, 1996), we obtain an online algorithm that efficiently estimates the weights of all the possible models from an arbitrarily large (possibly infinite) pool of suffix tree transducers. The article starts by defining the class of models used in it for learning a mapping from input sequences to output sequences: suffix tree transducers. Informally, suffix tree transducers are prediction trees where the features used to determine the predicting node are suffixes of the input sequence. We then describe a mixture model of suffix tree transducers. The specific mixture model considered exploits the tree structure of the transducers by defining a hierarchy of trees where each tree in the mixture is a refinement of a smaller tree in the hierarchy. Although there might be a huge number of tree-based transducers in the pool of models, we show that by using a recursive prior, we can efficiently calculate the prediction of the entire mixture in time linear in the depth of the largest tree in the mixture. This prior can be viewed as a creation process where at a given node in the suffix tree, we choose either to step successively down the tree and create more nodes or stop and use the current node for predicting the outcome (the next symbol of the output sequence). Based on the recursive prior, we describe how to maintain, in a computationally efficient manner, posterior probabilities of the same recursive structure as the prior probability distribution. Furthermore, we show that this construction of posterior probabilities is not only efficient but also robust in the sense that the predictions of the mixture of suffix trees are almost as good as the predictions of the best model in the ensemble of models considered. Specifically, we prove that the extra loss incurred for the mixture depends only on the choice of the prior and does not grow with the number of examples. We also look at the problem of estimating the output probability functions at the nodes. We present two online parameter estimation schemes: a simple smoothing technique for small alphabets and a more sophisticated mixture-based parameter estimation technique, which is suited for large alphabets. Both schemes are adaptive, enabling us to output predictions while estimating the parameters of all the transducers in the mixture. We show for both schemes that the extra loss incurred by the online estimation (and prediction) algorithm grows sublinearly in the number of examples. Combining the mixture-weighing technique for all possible suffix trees with the mixture-based parameter estimation scheme yields a flexible and powerful learning algorithm. Put another way, no hard decisions are made by
Adaptive Mixtures of Probabilistic Transducers
1713
the learning algorithm, not even for the set of relevant parameters of each model. We conclude with simulations and experiments using natural data. The experimental results support the formal analysis and indicate that the learning algorithm is indeed able to track the best model in an arbitrarily large pool of models, yielding an accurate approximation of the source. 2 Mixtures of Suffix Tree Transducers Let 6in and 6out be two finite alphabets. A suffix tree transducer T over (6in , 6out ) is a |6in |-ary tree where every internal node of T has one child for each symbol in 6in . The nodes of the tree are associated with pairs (s, γs ), where s is the string associated with the path (sequence of symbols in 6in ) that leads from the root to that node, and γs : 6out → [0, 1] is an output probability function. A suffix tree transducer maps arbitrarily long input sequences over 6in to output sequences over 6out as follows. The probability that a suffix tree transducer T will output a symbol y ∈ 6out at time step n, denoted by P(y | x1 , . . . , xn , T), is γsn (y), where sn is the string labeling the leaf reached by taking the path corresponding to xn xn−1 xn−2 . . . starting at the root of T. We assume without loss of generality that the input sequence x1 , . . . , xn was padded with a sufficiently long sequence . . . x−2 x−1 x0 so that the leaf corresponding to sn is well defined for any time step n ≥ 1. The n given an input probability that T will output a string y1 , . . . , yn in 6out n , . . . , x in 6 , denoted by P(y , . . . , y | x , . . . , x , T), is therefore string x 1 n 1 n 1 n in Qn k=1 γsk (yk ). Note that only the leaves of T are used for prediction. A suffix tree transducer is thus a probabilistic mapping that induces a measure over the possible output strings given an input string. A submodel T0 of a suffix tree transducer T is obtained by removing some of the internal nodes and leaves of T and using the leaves of the pruned tree T0 for prediction. An example of a suffix tree transducer and two of its possible submodels is given in Figure 1. In this article we consider transducers based on complete suffix trees. That is, each node in a given suffix tree is either a leaf or all of its children are also in the tree. For a suffix tree transducer T and a node s ∈ T, we use the following notations: • The set of all possible complete subtrees of T (including T itself) is denoted by Sub(T). • The set of leaves of T is denoted by Leaves(T). • The set of subtrees T0 ∈ Sub(T) such that s ∈ Leaves(T0 ) is denoted by Ls (T). • The set of subtrees T0 ∈ Sub(T) such that s ∈ T0 and s 6∈ Leaves(T0 ) is denoted by Is (T). • The set of subtrees T0 ∈ Sub(T) such that s ∈ T0 is denoted by As (T).
1714
Yoram Singer
@@@@@@ P(a)=0.5 @@@@@@ e P(b)=0.25 P(c)=0.25 @@@@@@ @@@@@@ @@@@@@
P(a)=0.5 P(b)=0.2 P(c)=0.3
@@@@@ @@@@@ 0 @@@@@ @@@@@ @@@@@
P(a)=0.3 P(b)=0.2 P(c)=0.5
@@@@@@ @@@@@@ @@@@@@ 00 @@@@@@ @@@@@@ @@@@@@ P(a)=0.8 P(b)=0.1 P(c)=0.1
@@@@@@ @@@@@@ 1 @@@@@@ @@@@@@ @@@@@@ P(a)=0.7 P(b)=0.2 P(c)=0.1
@@@@@@ @@@@@@ @@@@@@ 10 @@@@@@ @@@@@@ @@@@@@
@@@@@ @@@@@ @@@@@ 010 @@@@@ @@@@@
P(a)=0.5 P(b)=0.3 P(c)=0.2
P(a)=0.4 P(b)=0.2 P(c)=0.4
@@@@@@ @@@@@@ 01 @@@@@@ @@@@@@ @@@@@@
P(a)=0.6 P(b)=0.3
P(c)=0.1 @@@@@ @@@@@ @@@@@ 11 @@@@@ @@@@@ @@@@@
P(a)=0.5 P(b)=0.4 P(c)=0.1
@@@@@@ @@@@@@ @@@@@@ 110 @@@@@@ @@@@@@
©
ª
Figure 1: A suffix tree transducer over (6in , 6out ) = ({0, 1} , a, b, c ) and two of its possible submodels (subtrees). The strings labeling the nodes are the suffixes of the input string used to predict the output string. At each node, there is an output probability function defined for each of the possible output symbols. For instance, using the full suffix tree transducer, the probability of observing the symbol b given that the input sequence is . . . 010, is 0.1. The probability of the current output, when each transducer is associated with a weight (prior), is the weighted sum of the predictions of each transducer. For example, assume that the weights of the trees are 0.7 (full tree), 0.2 (large subtree), and 0.1 (small subtree); then the probability that the output yn = a given that (xn−2 , xn−1 , xn ) = (0, 1, 0) is 0.7 · P(a | 010, T1 ) + 0.2 · P(a | 10, T2 ) + 0.1 · P(a | 0, T3 ) = 0.7 · 0.8 + 0.2 · 0.7 + 0.1 · 0.5 = 0.75.
Whenever it is clear from the context, we will omit the dependency on T and denote the above sets as Ls , Is , and As . Note that the above definitions imply that As = Ls ∪ Is . For a given suffix tree transducer T we are interested in the mixture of all possible submodels (subtrees) of T. We associate with each subtree (including T itself) a weight that can be interpreted as its prior probability. We later show how the learning algorithm for a mixture of suffix tree transducers adapts these weights in accordance with the performance (evidence, in Bayesian terms) of each subtree on past observations. Direct calculation of the mixture probability is infeasible since there might be an enormous
Adaptive Mixtures of Probabilistic Transducers
1715
number of such subtrees. However, as shown subsequently, the technique introduced in Willems et al. (1995) can be generalized and applied to our setting. Let T0 be a subtree of T. Denote by n1 the number of internal nodes of T0 and by n2 the number of leaves of T0 that are not leaves of T. For example, n1 = 2 and n2 = 1 for the small sub tree of the suffix tree transducer depicted in Figure 1. We define the prior weight of a tree T0 , denoted by P0 (T0 ), to be (1 − α)n1 α n2 , where α ∈ (0, 1). It can be easily verified by induction on the number of nodes P of T that this definition of the weights is a proper measure, that is, T0 ∈Sub(T) P0 (T0 ) = 1. This distribution over suffix trees can be extended to trees of unbounded depth assuming that T is an infinite |6in |-ary suffix tree transducer. The prior weights can be interpreted as if they were created by the following recursive probabilistic process. We start with a suffix tree that includes only the root node. With probability α, we stop the process and with probability 1 − α, we add all the possible |6in | children of the node and continue the process recursively for each child. For a prior probability distribution over suffix trees of unbounded depth, the process ends when no new children were created. For bounded-depth trees, the recursive process stops if either no new children were created at a node or if the node is a leaf of the maximal suffix tree T. This notion of recursive equations for hierarchical mixtures of experts architecture has been used before (Jordan & Jacobs, 1994). In the context of suffix tree transducers, such a recursive formulation has several merits. First, as we show in the next section, it enables us to calculate the prediction of a mixture that includes a huge number of models, which cannot be computed directly by calculating the contribution of each individual model in the mixture. Second, the self-similar structure of the prior enables the pool of models to grow on-the-fly as more data arrive, such that the current mixture is a refinement of its predecessor. Last, the recursive definition of the prior probability distribution implies that larger (and more complex) suffix trees are assigned a smaller prior probability. Thus, the influence of large suffix trees, which might overfit the data, is contrasted with their initial small prior probability and the danger that the entire mixture of suffix trees would overfit the data is relatively small (if not negligible). Each node in the pool of suffix trees has two roles: it can be either a leaf and used for prediction or an internal node that simply resides on a path to one of its possible children nodes. Thus, the weight α can be interpreted as the prior probability that the node is a leaf of any transducer T0 ∈ Sub(T), that is, P 0 T0 ∈Ls P0 (T ) . (2.1) α= P 0 T0 ∈As P0 (T ) Therefore, for a bounded depth transducer T, if s ∈ Leaves(T), then
1716
Yoram Singer
As = Ls ⇒ α = 1, and the recursive process stops with probability 1. In fact, we can associate a different prior probability αs with each node s ∈ T. For the sake of brevity, we omit the dependency on the node s and assume that all the priors for internal nodes are equal, that is, ∀s 6∈ Leaves(T): αs = α. Using the recursive prior over suffix tree transducers, we can calculate the prediction of the mixture of all T0 ∈ Sub(T) on the first output symbol y = y1 as follows, αγ² (y) + (1 − α)(αγx1 (y) + (1 − α) · (αγx0 x1 (y) + (1 − α)(αγx−1 x0 x1 (y) + (1 − α) . . .))). (2.2) The calculation ends when a leaf of T is reached or when the beginning of the input sequence is encountered. Therefore, the prediction time of a single symbol is bounded by the maximal depth of T, or the length of the input sequence if T is infinite. Let us assume from now on that T is finite. (A similar derivation holds when T is infinite.) Let γ˜s (y) be the prediction propagating from a node s labeled by xl , . . . , x1 (we give a precise definition of γ˜s in the next section), that is, γ˜s (y) = γ˜xl ,...,x1 (y) = αγxl ,...,x1 (y) + (1 − α) ( αγxl−1 ,...,x1 (y) + (1 − α) ( αγxl−2 ,...,x1 (y) + (1 − α) ( . . . + γxj ,...,x1 (y)) . . . ))) ,
(2.3)
where xj , . . . , x1 is a label of a leaf in T. Let σ = x1−|s| ; then the above sum can be evaluated recursively as follows: ½ s ∈ Leaves(T) γs (y) . (2.4) γ˜s (y) = α γs (y) + (1 − α) γ˜σ s (y) otherwise The prediction of the whole mixture of suffix tree transducers on y is the prediction propagated from the root node ², namely, γ˜² (y). For example, for the input sequence . . . 0110, output symbol y = b, and α = 1/2, the predictions propagated from the nodes of the full suffix tree transducer from figure 1 are γ˜110 (b) = 0.4, γ˜10 (b) = 0.5 γ10 (b) + 0.5 0.4 = 0.3, γ˜0 (b) = 0.5 γ0 (b) + 0.5 0.3 = 0.25, γ˜² (b) = 0.5 γ² (b) + 0.5 0.25 = 0.25. 3 An Online Learning Algorithm We now describe an efficient learning algorithm for the mixture of suffix tree transducers. The learning algorithm uses the recursive prior and the evidence to update efficiently the posterior probability of each possible
Adaptive Mixtures of Probabilistic Transducers
1717
transducer in the mixture. In this section, we assume that the output probability functions at the nodes are known. Hence, at each time step, we need to evaluate the following, X P(yn | x1 , . . . , xn , T0 )Pn−1 (T0 ), (3.1) P(yn | x1 , . . . , xn ) = T0 ∈Sub(T)
where Pn (T0 ) is the posterior weight of the suffix tree transducer T after n input-output pairs, (x1 , y1 ), . . . , (xn , yn ). Omitting the dependency on the input sequence for the sake of brevity, Pn (T0 ) equals to Q P0 (T0 ) ni=1 P(yi | T0 ) Qn Pn (T0 ) = P 00 00 i=1 P(yi | T ) T00 ∈Sub(T) P0 (T ) P0 (T0 )P(y1 , . . . , yn | T0 ) . 00 00 T00 ∈Sub(T) P0 (T )P(y1 , . . . , yn | T )
= P
(3.2)
Direct evaluation of the above is again infeasible due to the fact that there is an enormous number of submodels of T. However, using the technique of recursive evaluation as in equation 2.4, we can efficiently calculate the prediction of the mixture. Let s be a node in T that is reached at time step n (s = xn−|s|+1 , . . . , xn ). We denoted by 1 ≤ i1 < i2 < · · · < il = n the time steps at which the node s was reached when observing a given sequence of n input-output pairs. Similar to the definition of the recursive prior α, we define qn (s) to be the posterior probability (on the subsequences reaching s) that the node s is a leaf. Analogously to the interpretation of the prior as a ratio of probabilities of suffix tree, qn (s) is the ratio between the weighted predictions of all subtrees for which the node s is a leaf and the weighted predications of all subtrees that include s. That is, P 0 0 T0 ∈Ls P0 (T )P(yi1 , . . . , yil | T ) . (3.3) qn (s) = P 0 0 T0 ∈As P0 (T )P(yi1 , . . . , yil | T ) We now can compute the predictions of the nodes along the path defined by xn xn−1 xn−2 . . . simply by replacing the prior weight α with the posterior weight qn−1 (s), ( γs (yn ) s ∈ Leaves(T) , (3.4) γ˜s (yn ) = qn−1 (s) γs (yn ) + (1 − qn−1 (s)) γ˜σ s (yn ) otherwise where σ = xn−|s| . We show in theorem 1 that γ˜s is the following weighted prediction of all the suffix tree transducers that include the node s, P 0 0 T0 ∈As P0 (T )P(yi1 . . . yil | T ) . (3.5) γ˜s (yn ) = P 0 0 T0 ∈As P0 (T )P(yi1 . . . yil−1 | T )
1718
Yoram Singer
In order to update qn (s) we introduce one more variable, which we denote by rn (s). Setting r0 (s) = log(α/(1 − α)) for all s, rn (s) is updated as follows: rn (s) = rn−1 (s) + log(γs (yn )) − log(γ˜σ s (yn )).
(3.6)
Based on the definition of qn (s), rn (s) is the log-likelihood ratio between the weighted predictions of the subtrees for which s is a leaf and the weighted predictions of the subtrees for which s is an internal node. The new posterior weights qn (s) are calculated from rn (s) using a sigmoid function, ´ ³ qn (s) = 1/ 1 + e−rn (s) .
(3.7)
For a node s0 that is not reached at time step n, we simply define rn (s0 ) = rn−1 (s0 ) and qn (s0 ) = qn−1 (s0 ). In summary, for each new observation pair, we traverse the tree by following the path that corresponds to the input sequence xn xn−1 xn−2 . . . The predictions at the nodes along the path are calculated using equation 3.4. Given these predictions, the posterior probabilities of being a leaf are updated for each node s along the path using equations 3.6 and 3.7. Finally, the probability of yn induced by the whole mixture is the probability propagated from the root node, as stated by the following theorem. (The proof of the theorem is given in the appendix.) Theorem 1.
γ˜² (yn ) =
P
T0 ∈Sub(T) P(yn
| T0 )Pn−1 (T0 ).
We now discuss the performance of a mixture of suffix tree transducers. Let Lossn (T) be the negative log-likelihood of a suffix tree transducer T achieved on n input-output pairs, Lossn (T) = −
n X
log2 (P(yi | T)).
(3.8)
i=1
Similarly, the loss of the mixture is defined to be Lossmix n
=−
n X i=1
=−
n X
à log2
X
! 0
0
P(yi | T )Pi−1 (T )
T0 ∈Sub(T)
log2 (γ˜² (yi )).
(3.9)
i=1
The advantage of using a mixture of suffix tree transducers over a single suffix tree is due to the robustness of the solution, in the sense that the predictions of the mixture are almost as good as the predictions of the best suffix tree in the mixture, as shown in the following theorem.
Adaptive Mixtures of Probabilistic Transducers
1719
Theorem 2. Let T be a (possibly infinite) suffix tree transducer, and let (x1 , y1 ), . . . , (xn , yn ) be any possible sequence of input-output pairs. The loss of the mixture is at most min
ª Lossn (T0 ) − log2 (P0 (T0 )) .
©
T0 ∈Sub(T)
The running time of the algorithm is Dn, where D is the maximal depth of T or 1/2(n + 1)2 when T is of an unbounded depth. The proof of theorem 2 is given in the appendix. Note that the extra loss incurred for the mixture is fixed and does not grow with the number of examples. Moreover, theorem 2 implies that if the best model changes with time, then the learning algorithm implicitly tracks such a change at an additional cost that depends only on the complexity, that is, the prior probability, of the (new) best model. 4 Parameter Estimation In this section we describe how the output probability functions are estimated. We devise two adaptive schemes that track the maximum likelihood assignment for the parameters of the output probability functions. To simplify notation, we omit the dependency on the node s and regard the subsequence yi1 yi2 . . . yik (where ij are the time steps at which the node s was reached) as an individual sequence y1 y2 . . . yk . We show later in this section how the estimations of the output probability functions at the different nodes are combined. Denote by cn (y) = |{i | yi = y, 1 ≤ i ≤ n}| the number of times an output symbol y ∈ 6out was observed out of the n times a node was visited, def
and let K = |6out |. A commonly used estimator approximates the output probability function at time step n + 1 by adding a constant δ to each count cn (y), γ (y) ≈ γˆ n (y) = P
cn (y) + δ cn (y) + δ = . 0 n+δK y0 ∈6out cn (y ) + δ K
(4.1)
The estimator that achieves the best likelihood on a given sequence of length n, called the maximum likelihood estimator (MLE), is attained for δ = 0: γML (y) =
cn (y) . n
(4.2)
The MLE is evaluated in batch, after the entire sequence is observed, while the online estimator defined by equation 4.1 can be used on the fly as new symbols arrive. The special case of δ = 12 is called Laplace’s modified rule
1720
Yoram Singer
of succession or the add 12 estimator. Krichevsky and Trofimov (1981) proved that the likelihood attained by the add 12 estimator is almost as good as the likelihood achieved by the maximum likelihood estimator. Applying the bound of Krichevsky and Trofimov in our setting results in the following bound on the predictions of the add 12 estimator, −
n X
log2 (γˆ i−1 (yi )) = −
i=1
≤−
n X
µ log2
i=1 n X i=1
ci−1 (yi ) + 1/2 i − 1 + K/2
µ log (cn (yi )/n) +(K − 1) } | 2 {z
¶
¶ 1 log2 (n) + 1 . 2
(4.3)
=log2 (γML (yi ))
For a mixture of suffix tree transducers, we use the add 12 estimator at each node separately. To do so, we simply need to store counts of the number of appearances of each symbol at each node of the maximal suffix tree transducer T. Let T0 ∈ Sub(T) be any suffix tree transducer from the pool of possible submodels whose parameters are set so that it achieves the maximum likelihood on (x1 , y1 ), . . . , (xn , yn ), and let Tˆ 0 be the same suffix tree transducer but with parameters adaptively estimated using the add 12 estimator. Denote by ns the number of times a node s ∈ Leaves(T0 ) was reached on the input-output sequence, and let l(T0 ) = |Leaves(T0 )| be the total number of leaves of T0 . From theorem 1 we get, ˆ0 ˆ0 Lossmix n ≤ Lossn (T ) − log2 (P0 (T )).
(4.4)
From the bound of equation 4.3 on the add 12 estimator we get, X ¡ ¢ ˆ 0 ) ≤ Lossn (T0 ) + Lossn (T 1/2(K − 1) log2 (ns ) + K − 1 s∈Leaves(T0 )
X 1 l(T0 ) (K − 1) log2 (ns ) 0 2 s l(T ) ¶ µP l(T0 ) s ns 0 0 (K − 1) log2 ≤ Lossn (T ) + l(T )(K − 1) + 2 l(T0 ) ¶ ¶ µ µ 1 n log2 +1 , (4.5) = Lossn (T0 ) + l(T0 ) (K − 1) 2 l(T0 )
= Lossn (T0 ) + l(T0 )(K − 1) +
where we used the concavity of the logarithm function to obtain the last inequality. Finally, combining equations 4.4 and 4.5 we get, 0 0 Lossmix n ≤ Lossn (T ) − log2 (P0 (T )) ¶ ¶ µ µ n 1 0 log2 +1 . + l(T ) (K − 1) 2 l(T0 )
(4.6)
Adaptive Mixtures of Probabilistic Transducers
1721
The above bound holds for any T0 ∈ Sub(T); thus we obtain the following corollary: Corollary 3. Let T be a (possibly infinite) suffix tree transducer, and let (x1 , y1 ), . . . , (xn , yn ) be any possible sequence of input-output pairs. The loss of the mixture is at most ½ min
T0 ∈Sub(T)
Lossn (T0 ) − log2 (P0 (T0 )) µ
1 log2 + l(T )(K − 1) 2 0
µ
n l(T0 )
¶
¶¾ +1
,
where the parameters of each T0 ∈ Sub(T) are set so as to maximize the likelihood of T0 on the entire input-output sequence. The running time of the algorithm is K·D·n, where D is the maximal depth of T or K2 (n + 1)2 when T is of an unbounded depth.
Note that the above corollary holds for any n. Therefore, the mixture model tracks the best transducer, although that transducer and its parameters change in time. The additional cost we need to pay for not knowing ahead of time the transducer’s structure and parameters grows sublinearly with the length of the input-output sequence like O(log(n)). Normalizing this additional cost term by the length of the sequence, we get that the predictions of the adaptively estimated mixture lag behind the predictions of the best transducer by an additive factor, which behaves like C log(n)/n→0, where the constant C depends on the total number of parameters of the best transducer. When K is large, the smoothing of the output probabilities using the add 12 estimator is too crude. Furthermore, in many real problems, only a small subset of the full output alphabet is observed in a given context (a node in the tree). For example, when mapping phonemes to phones (Riley, 1991), for a given sequence of input phonemes, the possible phones that can be pronounced are limited to a few possibilities (typically, two to four). Therefore, we would like to devise an estimation scheme whose performance depends on the actual local (context-dependent) alphabet, not on the full alphabet. Such an estimation scheme can be devised by employing a mix0 of 6 . Although ture of models—one model for each possible subset 6out out K there are 2 possible subsets of 6out , we now show that if the estimation technique depends only on the size of each subset, then the whole mixture can be maintained in time linear in K. As before, we omit the dependency on the node and treat the subsequence 0 ) the that reaches a node as an individual sequence. Denote by γˆ n (y | 6out estimate of γ (y) after n observations assuming that the actual alphabet is
1722
Yoram Singer
0 . Using the add 1 estimator, we get 6out 2 0 0 γˆ n (y | 6out ) = γˆ n (y | |6out | = i) = (cn (y) + 1/2)/(n + i/2).
(4.7)
0 | = i) as γˆ n (y | i). For the sake of brevity we simply denote γˆ n (y | |6out n Let 6out be the set of different output symbols that were observed at a node, that is,
© ª n def = σ | σ ∈ 6out , σ = yi , 1 ≤ i ≤ n . 6out def
0 n to be the empty set and let Kn = |6out |. Since each possible We define 6out 0 of size i = |6 0 | should contain all symbols in 6 n , there are alphabet 6out out out ¡ ¢ n be the posterior probability n exactly K−K possible alphabets of size i. Let w i i−Kn (after n observations) of any alphabet of size i. The prediction of the mixture 0 such that 6 n ⊆ 6 0 ⊆ 6 of all possible subsets 6out out is out out
γˆ n (y) =
¶ K µ X K − Kn j=Kn
j − Kn
wjn γˆ n (y | j).
(4.8)
Evaluation of this sum requires O(K) operations (and not O(2K )). Furthermore, we can compute equation 4.8 in an online fashion as follows. Let Win be the (prior weighted) likelihood of all alphabets of size i: def
Win =
X 0 |6out |=i
w0i
n Y
0 γˆ k−1 (yk | 6out )
k=1
µ ¶ n Y K − Kn = γˆ k−1 (yk | i). w0i i − Kn k=1
(4.9)
Without loss of generality, let us assume that all alphabets are equally likely ¡ ¢ 0 ) = 1/2K . Therefore, W 0 = K /2K .1 The weights a priori, that is, P0 (6out i i Win can be updated incrementally depending on the following cases. If the n | > i), number of different symbols observed so far exceeds a given size (|6out then all alphabets of this size are too small and should be eliminated from the mixture by setting their posterior probability to zero. Otherwise, if the n ), the output probability is next symbol was observed before (yn+1 ∈ 6out 1 approximated by the add 2 estimator. Last, if the next symbol is entirely new 1 One can easily define and use nonuniform priors. For instance, the assumption that all alphabet sizes are equally likely a priori and all alphabets of the same size have the same prior probability results in a prior that favors small and large alphabets. Such a prior is used in the example in Figure 2.
Adaptive Mixtures of Probabilistic Transducers
1723
n+1 n (yn+1 6∈ 6out ), we need to rescale Win for i ≥ |6out | for the following reason. n+1 n n Since yn+1 6∈ 6out , then 6out = 6out ∪ {yn+1 } ⇒ Kn+1 = Kn + 1 and
µ
K − Kn+1 i − Kn+1
¶ µ µ ¶ K − Kn i − Kn K − Kn − 1 = . i − Kn K − Kn i − Kn − 1
¶ =
Hence, the number of possible alphabets of size i has decreased by a factor of (i − Kn )/(K − Kn ) and each weight Win+1 (i ≥ Kn+1 ) has to be multiplied by this factor in addition to the prediction of the add 12 estimator on yn+1 . The above three cases can be summarized as an update rule of Win+1 from Win as follows,
Win+1 = Win ×
0
if Kn+1 > i
cn (yn+1 )+1/2 n+i/2 i−Kn 1/2 K−Kn n+i/2
n if Kn+1 ≤ i and yn+1 ∈ 6out . (4.10) n if Kn+1 ≤ i and yn+1 6∈ 6out
Applying Bayes rule for the mixture of all possible subsets of the output alphabet, we get γˆ n (yn+1 ) =
K X
Win+1 /
K X
i=1
Win =
K X
Win+1 /
i=Kn
i=1
K X
Win .
(4.11)
i=Kn
˜ n be the posterior probability of all possible alphabets of size i, that is, Let W i P n def ˜ n = Wn / W W . Let Pn (yn+1 ) = W n+1 /W n be the sum of the predictions i
i
j
i
j
i
i
of all possible alphabets of size i on yn+1 . Thus, the prediction of the mixture γˆ n (yn+1 ) can be rewritten as γˆ n (yn+1 ) =
K X i=1
Win+1 /
K X i=1
Win =
K X
˜ n Pn (yn+1 ). W i i
i=1
˜ n in In Figure 2 we give an illustration of the values obtained for W i the following experiment. We simulated a multinomial source whose output symbols belong to an alphabet of size |6out | = 10 and set the probabilities of observing any of the last five symbols to zero. Therefore, the ˜ n (1 ≤ i ≤ 10), actual alphabet is of size 5. The posterior probabilities, W i were calculated after each iteration. These probabilities are plotted left to right and top to bottom. The first observations rule out alphabets of size smaller than 5, setting their posterior probability to zero. After a few observations, the posterior probability is concentrated around the actual size, yielding an accurate and adaptive estimate of the actual alphabet size of the source.
1724
Yoram Singer
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0
0
0
0
5
10
5
10
5
10
0 5
10
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0
0
0
0
5
10
5
10
5
10
10
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0 5
0
10
5
10
0 5
10
10
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0
0
0
0
10
5
10
5
10
10
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0
0
0
0
10
5
10
5
10
10
5
10
5
10
5
10
0 5
0.4
5
5
0 5
0.4
5
10
0 5
0.4
0
5
0 5
10
Figure 2: An illustration of adaptive estimation of the alphabet size for a multinomial source with a large number of possible outcomes when only a subset of the full alphabet is actually observed. In this illustration, examples were randomly drawn from an alphabet of size 10, but only five symbols have a nonzero probability. The posterior probability concentrates on the right size after fewer than 20 iterations. The initial prior used in this example assigns an equal probability to the different alphabet sizes.
Let γML denote again the MLE of γ evaluated after n input-output observation pairs, cn (y) = γML (y) = n
(
cn (y)/n
n y ∈ 6out
0
n y 6∈ 6out
.
(4.12)
We can now apply the same technique used to bound the loss of a mixture of transducers to bound the loss of a mixture of all possible subsets of the output alphabet. First, we bound the likelihood of the mixture of add 12 estimators based on all possible subsets of 6out . We do so by comparing the mixture’s likelihood to the likelihood achieved by a single add 12 estimator n , that uses the “right” size (i.e., the estimator that uses the final alphabet, 6out for prediction), −
n X i=1
log2 (γˆ i−1 (yi )) ≤ −
n X i=1
n n log2 (γˆ i−1 (yi | 6out )) − log2 (P0 (6out ))
Adaptive Mixtures of Probabilistic Transducers
=−
n X
³ ´ n log2 (γˆ i−1 (yi | 6out )) + log2 2K .
1725
(4.13)
i=1
Using the bound on the likelihood of the add 12 estimator from equations 4.3, we get
−
n X
n log2 (γˆ i−1 (yi | 6out ))
i=1
≤−
n X
log2 (γML (yi )) + (Kn − 1)(1/2 log2 (n) + 1)
(4.14)
i=1
Combining the bounds from equations 4.13 and 4.14, we obtain a bound on the predictions of the mixture with online estimate parameters relative to the predictions of maximum likelihood model,
−
n X
log2 (γˆ i−1 (yi ))
i=1
≤−
n X
log2 (γML (yi )) + (Kn − 1)(1/2 log2 (n) + 1) + K.
(4.15)
i=1 2K
It is easy to verify that if n ≥ 2 K−Kn , which is the case when the observed alphabet is small (Kn ¿ K), then a mixture of add 12 predictors, one for each possible subset of the alphabet, achieves a better performance than a single add 12 predictor based on the full alphabet. Applying the online mixture estimation technique twice, first for the mixture of all possible submodels of a suffix tree transducer T and then for the parameters of each model in the mixture, yields an accurate and efficient online learning algorithm for both the structure and the parameters of the source. As before, let T0 ∈ Sub(T) be a suffix tree transducer whose parameters are set so as to maximize the likelihood of T0 on the entire input-output sequence. l(T0 ) is again the number of leaves of T0 , and ns the number of times a node s ∈ Leaves(T0 ) was reached on a sequence of n input-output pairs. Denote by ms the number of nonzero parameters of the output probability ∀n0 ≤ n, the value function of T0 at node s. Note that for a given node s andP 0 0 0 Kn for that node satisfies Kn ≤ ms . Denote by m(T ) = s∈Leaves(T0 ) ms the total number of nonzero parameters of the maximum likelihood transducer T0 . Combining the bound on the mixture of transducers from theorem 1 with the bound on the mixture of parameters from equation 4.15 while following the same steps used to derive equation 4.6, we get that
1726
Yoram Singer 0 0 Lossmix n ≤ Lossn (T ) − log2 (P0 (T )) X £ ¡ ¢ ¤ + (ms − 1) 1/2 log2 (ns ) + 1 + K s∈Leaves(T0 )
≤ Lossn (T0 ) − log2 (P0 (T0 )) +l(T0 )(K − 1) + m(T0 ) + Now,
X s
X 1 (ms − 1) log2 (ns ). (4.16) 2 s∈Leaves(T0 )
X (ms − 1) log2 (ns /(ms − 1))
(ms − 1) log2 (ns ) =
s
X + (ms − 1) log2 (ms − 1) s
X
≤
s
(ms − 1) log2 (ns /(ms − 1))+m(T0 ) log2 (K).
Using the log-sum inequality (Cover & Thomas, 1991) we get, X (ms − 1) log2 (ns /(ms − 1)) ≤ s
Ã
X s
! ms log2
P
µ P
s ns
s (ms
− 1)
¶
= m(T0 ) log2
µ
¶ n . m(T0 ) − l(T0 )
Therefore, 0 0 Lossmix n ≤ Lossn (T ) − log2 (P0 (T )) ¶ µ 1 n + m(T0 ) log2 2 m(T0 ) − l(T0 )
+ l(T0 )(K − 1) + m(T0 ) log2 (2K).
(4.17)
As before, the above bound holds for any T0 ∈ Sub(T). Therefore, the mixture model again tracks the best transducer in the pool, but the additional cost for specifying the parameters is smaller. Put another way, predictions of a mixture of suffix tree transducers, each with parameters estimated using a mixture of all possible alphabets, lag behind the predictions of the maximum likelihood model by an additive factor that scales again like C log(n)/n; but the constant C now depends on the total number of the nonzero parameters of the maximum likelihood model. Using efficient data structures to maintain the pointers to all the children of a node, both the time and the space complexity of the algorithm are O(Dn|6out |) (O(n2 |6out |) for unbounded depth transducers. Furthermore, if the model is kept fixed after the training phase, we can evaluate the final estimation of the output functions once for each node, and the time complexity during a test (prediction only) phase reduces to O(D) per observation.
Adaptive Mixtures of Probabilistic Transducers
1727
Figure 3: (Left) Performance comparison of the predictions of a single model and two mixture models. The y-axis is the normalized negative log-likelihood where the base of the log is |6out |. (Right) The relative weights of three transducers in the mixture.
5 Evaluation and Applications In this section we present evaluation results of the model and its learning algorithm on artificial data. We also discuss and present results obtained from learning syntactic structure of noun phrases. In all the experiments described in this section, α was set to 12 . The low time and space complexity of the learning algorithm enables us to evaluate the algorithm on millions of input-output pairs in a few minutes. For example, the average update time for a mixture of suffix tree transducers of maximal depth 10 when |6out | = 4 is about 0.2 millisecond on an SGI Indy 200 Mhz workstation. Results of the online predictions various estimation schemes are shown on the lefthand side of Figure 3. In this example, 6out = 6in = {1, 2, 3, 4}. The description of the source is as follows. If xn ≥ 3, then yn is uniformly distributed over 6out ; otherwise, (xn ≤ 2) yn = xn−5 with probability 0.9 and yn−5 = 5 − xn−5 with probability 0.1. The input sequence x1 x2 . . . was created entirely at random. This source can be implemented by a sparse suffix tree transducer of maximal depth 5. Note that the actual size of the alphabet is only 2 at half of the leaves of that tree. We used a mixture of suffix tree transducers of maximal depth 20 to learn the source. The normalized negative log-likelihood values are given for (1) a mixture model of suffix tree transducers with parameters estimated using a mixture of all subsets of the output alphabet, (2) a mixture model of the possible suffix tree transducers with parameters estimated using the add 12 scheme, and (3) a single model of depth 8 whose parameters are estimated using the add 12 scheme (note that the source creating the examples can be embedded in this model).
1728
Yoram Singer
After a relatively small number of examples, the prediction of the mixture model converges to the prediction of the source much faster than the single model. Furthermore, applying the mixture estimation for both the structure and the parameters significantly accelerates the convergence. In fewer than 10,000 examples, the cross-entropy between the mixture model and source is less than 0.01 bit; even after 50,000 examples, the cross-entropy between the source and a single model is about 0.1 bit. On the righthand side of Figure 3, we show how the mixture weights evolve as a function of the number of examples. At the beginning, the weights are distributed among all models where the weight of a model is inversely proportional to its size. When there are only a few examples, the estimated parameters are inaccurate, and the predictions of all models are rather poor. Hence, the posterior weights remain close to the initial prior weights. As more data arrive, the predictions of the transducers of a large enough structure (i.e., the source itself and larger trees in which the source can be embedded) become more and more accurate. Thus, the sum of their weights grows larger in the expense of the small models. To illustrate the above, we plot in Figure 3 the relative weight as a function of the number of examples for three transducers in the mixture. The first, TS , is based on a suffix tree, which is strictly contained in the suffix tree transducer that generated the examples. Thus, this transducer is too small to approximate the source accurately. The second, TC , is the transducer generating the examples, and the third, TL , is a transducer based on a large suffix tree in which the true source can be embedded. The predictions of the small model are consistently wrong, and its relative weight decreases to almost zero after fewer than 10,000 input-output pairs. Eventually the parameters of TC and TL become accurate, and their predictions are essentially the same. Thus, their relative weights converge to the relative prior weights: P0 (TC )/(P0 (TC ) + P0 (TL )) and P0 (TL )/(P0 (TC ) + P0 (TL )). As mentioned in the introduction, there is a broad range of applications of probabilistic transducers. Here we briefly demonstrate and discuss how to induce an English noun phrase recognizer based on a suffix tree transducer. Recognizing noun phrases is an important task in automatic natural text processing for applications such as information retrieval, translation tools, and data extraction from texts. A common practice is to recognize noun phrases by first analyzing the text with a part of speech tagger, which assigns the appropriate part-of-speech (verb, noun, adjective, etc.) for each word in context. Then noun phrases are identified by manually defined regular expression patterns that are matched against the part-of-speech sequences. An alternative approach is to learn a mapping from the part-ofspeech sequence to an output sequence that identifies the words that belong to a noun phrase (Church, 1988). We followed the second approach by building a suffix tree transducer based on a labeled data set from the Penn Tree Bank corpus. We defined 6in to be the set of possible part-of-speech tags and set 6out = {0, 1}, where an output symbol given its corresponding input
Adaptive Mixtures of Probabilistic Transducers
1729
Table 1: Extraction of Noun Phrases Using a Suffix Tree Transducer. Sentence Tom parts-of-speech tag PNP Class 1 Prediction 0.99
Smith
,
PNP 1 0.99
, 0 0.01
group chief executive NN 1 0.98
NN 1 0.98
Sentence and industrial materials , parts-of-speech tag CC JJ NNS , Class 1 1 1 0 Prediction 0.67 0.96 0.99 0.03
will MD 0 0.03
NN 1 0.98
of
U.K. metals
IN 0 0.02
PNP NNS 1 1 0.99 0.99
become chairman VB 0 0.01
NN 1 0.87
. . 0 0.01
Notes: In this typical example, two long noun phrases were identified correctly with high confidence. PNP = possessive nominal pronoun; NN = singular or mass noun; IN = preposition; NNS = plural noun; CC = coordination conjunction; JJ = adjective; MD = modal verb; VB = verb, baseform.
sequence (the part-of-speech tags of the words constituting the sentence) is 1 iff the word is part of a noun phrase. We used 250,000 marked tags and tested the performance on 37,000 tags. Testing the performance of the suffix tree transducer based tagger is performed by freezing the model structure, the mixture weights, and the estimated parameters. The suffix tree transducer was of maximal depth 15; hence very long phrases could be potentially identified. By comparing the predictions of the transducer to a threshold (1/2), we labeled each tag in the test data as either a member of a nonphrase or a filler. An example of the prediction of the transducer and the resulting classification is given in Table 1. We found that 2.4 percent of the words were labeled incorrectly. For comparison, when a single (zero-depth) suffix tree transducer is used to classify words, the error rate is about 15 percent. It is impractical to compare the performance of our transducer-based tagger to previous works due to the many differences in the train and test data sets used to construct the different systems. Nevertheless, the low error rate achieved by the mixture of suffix tree transducers and the efficiency of its learning algorithm suggest that this approach might be a powerful tool for language processing. One of our future research goals is to investigate methods for incorporating linguistic knowledge through the set of priors. Appendix Proof of Theorem 1. Let s be any node in the full suffix tree T. To remind the reader, we denote by 1 ≤ i1 < i2 < · · · < il = n the time steps at which the node s was reached when observing a given sequence of n input-output
1730
Yoram Singer
pairs. Define def
P
def
P
def
P
Ls (n) = Is (n) = As (n) =
T0 ∈Ls T0 ∈Is T0 ∈As
P0 (T0 )P(yi1 . . . yil | T0 ), P0 (T0 )P(yi1 . . . yil | T0 ), P0 (T0 )P(yi1 . . . yil | T0 ).
Note that since As = Ls ∪ Is , then As (n) = Ls (n) + Is (n). Also, if s was reached at time step n (il = n), then Ls (n) =
X T0 ∈L
=
X
P0 (T0 )P(yi1 . . . yil | T0 ) s
P0 (T0 )P(yi1 . . . yil−1 | T0 ) γs (yn ),
(A.1)
T0 ∈Ls
and therefore γs (yn ) = Ls (n) /Ls (n − 1).
(A.2) def
For boundary conditions we define, ∀s ∈ Leaves(T) : rn (s) = ∞, ∀s : As (0) = def
def
As (−1) = 1, and γs (y0 ) = 1. Let the depth of a node be the node’s maximal distance from any of the leaves and denote it by d. We now prove the following two properties of the algorithm by induction on n and d. Property A: For all nodes s ∈ T: rn (s) = log(Ls (n)/Is (n)). Property B: For any node s ∈ T reached at time step n: γ˜s (yn ) = As (n)/As (n − 1). First note that if a node was not reached at time n, then Ls (n) = Ls (n − 1) and Is (n) = Is (n − 1) and property A holds based on the induction hypotheses for n − 1. Second, if a node s was reached at time step n, then the following also holds based on the induction hypotheses, qn (s) = 1/(1 + e−rn (s) ) = 1/(1 + Is (n)/Ls (n)) = Ls (n)/(Ls (n) + Is (n)) = Ls (n)/As (n).
(A.3)
We now show that the base of the induction hypotheses is true. Property A holds by definition for d = 0 (s ∈ Leaves(T)) and for all n. Since ∀s ∈ Leaves(T), As (n) = Ls (n), then from equation A.2 we get that γ˜s (yn ) = γs (yn ) = Ls (n)/Ls (n − 1) = As (n)/As (n − 1) and property B holds as well. For n = 0 and for all d, property A holds from the definition of the prior
Adaptive Mixtures of Probabilistic Transducers
1731
probabilities: µ
α r0 (s) = log 1−α Ã P def
¶
T0 ∈Ls
P
T0 ∈As
P0 (T0 )
!
P P0 (T0 ) / T0 ∈As P0 (T0 ) ¶ µ Ls (0)/As (0) = log 1 − Ls (0)/As (0) ¶ µ ¶ µ Ls (0) Ls (0) = log . = log As (0) − Ls (0) Is (0) = log
1−
P
P0 (T0 ) /
T0 ∈Ls
(A.4)
Property B simply follows from the defined boundary conditions. Assume that the induction assumptions hold for d0 < d, n0 ≤ n and d0 ≤ d, n0 < n. Since property A holds for any node not reached at time n, we need to show that the two properties hold only for nodes that are reached at time step n, namely, nodes of the form s = xj , . . . , xn where j ≤ n. Let σ = xn−|s| , then, using the induction hypotheses for γ˜ at the node σ s (of depth d − 1 < d) and for rn−1 (s) (at time n − 1 < n), we get, rn (s) = rn−1 (s) + log(γs (yn )/γ˜σ s (yn )) ¶ µ Ls (n − 1) Ls (n) Aσ s (n − 1) . = log Is (n − 1) Ls (n − 1) Aσ s (n)
(A.5)
The sets As and Is consist of complete subtrees. Therefore, if T0 ∈ Aσ s , then T0 ∈ Is , and Aσ s (n) = Is (n), which implies that, ¶ µ Ls (n − 1) Ls (n) Is (n − 1) rn (s) = log = log(Ls (n)/Is (n)). (A.6) Is (n − 1) Ls (n − 1) Is (n) Similarly, using the induction hypotheses for qn−1 (s) and the node σ s, γ˜s (yn ) = qn−1 (s)γs (yn ) + (1 − qn−1 (s))γ˜σ s (yn ) =
Is (n − 1) Aσ s (n) Ls (n − 1) Ls (n) + As (n − 1) Ls (n − 1) As (n − 1) Aσ s (n − 1)
=
Is (n − 1) Is (n) Ls (n − 1) Ls (n) + As (n − 1) Ls (n − 1) As (n − 1) Is (n − 1)
=
As (n) Ls (n) + Is (n) = . As (n − 1) As (n − 1)
(A.7)
Hence, the induction hypotheses hold for all n and d and in particular, γ˜² (yn ) = A² (n)/A² (n − 1) =
1732
Yoram Singer
X T0 ∈A
X
P0 (T0 )P(yi1 , . . . , yil | T0 ) /
T0 ∈A
²
P0 (T0 )P(yi1 , . . . , yil−1 | T0 ). (A.8) ²
Since the root node is reached at all time steps, then ij = j and yi1 , . . . , yil = y1 , . . . , yn . Thus γ˜² (yn ) is equal to, X X P0 (T0 )P(y1 , . . . , yn | T0 ) / P0 (T0 )P(y1 , . . . , yn−1 | T0 ). (A.9) γ˜² (yn ) = T0 ∈A²
T0 ∈A²
Therefore, the theorem holds since by definition A² = Sub(T). Proof of Theorem 2. The proof of the first part of the theorem is based on a technique introduced by DeSantis et al. (1988). Based on the definition of A² (n) and Pn−1 (T0 ) from theorem 1 we can rewrite Lossmix n as ! Ã n X X Lossmix log2 P(yi | T0 )Pi−1 (T0 ) n = − T0 ∈Sub(T)
i=1
=−
n X
µ log2
i=1
A² (i) A² (i − 1)
¶ = − log2
à n Y i=1
! A² (i) . A² (i − 1)
(A.10)
Canceling the numerator of one term with the denominator of the next, we get ! Ã ¶ µ A² (n) A² (n) mix = − log2 P Lossn = − log2 0 A² (0) T0 ∈Sub(T) P0 (T ) ! Ã X 0 0 P0 (T )P(y1 , . . . , yn | T ) = − log2 µ ≤ − log2 =
min
T0 ∈Sub(T)
T0 ∈Sub(T)
¶
max P0 (T0 )P(y1 , . . . , yn | T0 )
T0 ∈Sub(T)
©
ª Lossn (T0 ) − log2 (P0 (T0 )) .
(A.11)
This completes the proof of the first part of the theorem. At time step i, we traverse the suffix tree T until a leaf is reached (at most D nodes are visited) or the beginning of the input sequence is encountered (at most i nodes are visited). Hence, the runningPtime for a sequence of length n is Dn for a bounded depth suffix tree or ni=1 i ≤ 12 (n + 1)2 for an unbounded depth suffix tree. Acknowledgments Thanks to Y. Bengio, L. Lee, F. Pereira, and M. Shaw for helpful discussions. I also thank the anonymous reviewers for their helpful comments.
Adaptive Mixtures of Probabilistic Transducers
1733
References Bengio, Y., & Fransconi, P. (1994). An input/output HMM architecture. In Advances in neural information processing systems 7, (pp. 427–434). Cambridge, MA: MIT Press. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth International Group. Cesa-Bianchi, N., Freund, Y., Haussler, D., Helmbold, D. P., Schapire, R. E., & Warmuth, M. K. (1993). How to use expert advice. In Proceedings of the 25th Annual ACM Symposium on the Theory of Computing (pp. 382–391). Church, K. W. (1988). A stochastic parts program and noun phrase parser for unrestricted text. In Second Conference on Applied Natural Language Processing (pp. 136–143). Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. DeSantis, A., Markowski, C., & Wegman, M. N. (1988). Learning probabilistic prediction functions. In Proceedings of the First Annual Workshop on Computational Learning Theory (pp. 312–328). Giles, C. L., Miller, C. B., Chen, D., Sun, G. Z., Chen, H. H., & Lee, Y. C. (1992). Learning and extracting finite state automata with second-order recurrent neural networks. Neural Computation, 4, 393–405. Haussler, D., & Barron, A. (1993). How well do Bayes methods work for on-line prediction of {+1, −1} values? In The 3rd NEC Symposium on Computation and Cognition. Helmbold, D. P., & Schapire, R. E. (1995). Predicting nearly as well as the best pruning of a decision tree. In Proceedings of the Eighth Annual Conference on Computational Learning Theory (pp. 61–68). Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixture of local experts. Neural Computation, 3, 79–87. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Krichevsky, R. E., & Trofimov, V. K. (1981). The performance of universal coding. IEEE Transactions on Information Theory, 27, 199–207. Littlestone, N., & Warmuth, M. K. (1994). The weighted majority algorithm. Information and Computation, 108, 212–261. Quinlan, J. R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann. Riley, M. D. (1991). A statistical model for generating pronunication networks. In Proc. of IEEE Conf. on Acoustics, Speech, and Signal Processing (pp. 737–740). Ron, D., Singer, Y., & Tishby, N. (1996). The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning, 25, 117–149. Willems, F. M. J., Shtarkov, Y. M., & Tjalkens, T. J. (1995). The context tree weighting method: Basic properties. IEEE Transactions on Information Theory, 41(3), 653–664. Received September 12, 1996; accepted March 14, 1997.
Communicated by Ronald Williams
Long Short-Term Memory Sepp Hochreiter ¨ Informatik, Technische Universit¨at Munchen, ¨ ¨ Fakult¨at fur 80290 Munchen, Germany
Jurgen ¨ Schmidhuber IDSIA, Corso Elvezia 36, 6900 Lugano, Switzerland
Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter’s (1991) analysis of this problem, then address it by introducing a novel, efficient, gradientbased method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms. 1 Introduction In principle, recurrent networks can use their feedback connections to store representations of recent input events in the form of activations (short-term memory, as opposed to long-term memory embodied by slowly changing weights). This is potentially significant for many applications, including speech processing, non-Markovian control, and music composition (Mozer, 1992). The most widely used algorithms for learning what to put in shortterm memory, however, take too much time or do not work well at all, especially when minimal time lags between inputs and corresponding teacher signals are long. Although theoretically fascinating, existing methods do not provide clear practical advantages over, say, backpropagation in feedforward nets with limited time windows. This article reviews an analysis of the problem and suggests a remedy. Neural Computation 9, 1735–1780 (1997)
c 1997 Massachusetts Institute of Technology °
1736
Sepp Hochreiter and Jurgen ¨ Schmidhuber
The problem. With conventional backpropagation through time (BPTT; Williams & Zipser, 1992; Werbos, 1988) or real-time recurrent learning (RTRL; Robinson & Fallside, 1987), error signals flowing backward in time tend to (1) blow up or (2) vanish; the temporal evolution of the backpropagated error exponentially depends on the size of the weights (Hochreiter, 1991). Case 1 may lead to oscillating weights; in case 2, learning to bridge long time lags takes a prohibitive amount of time or does not work at all (see section 3). This article presents long short-term memory (LSTM), a novel recurrent network architecture in conjunction with an appropriate gradient-based learning algorithm. LSTM is designed to overcome these error backflow problems. It can learn to bridge time intervals in excess of 1000 steps even in case of noisy, incompressible input sequences, without loss of short-timelag capabilities. This is achieved by an efficient, gradient-based algorithm for an architecture enforcing constant (thus, neither exploding nor vanishing) error flow through internal states of special units (provided the gradient computation is truncated at certain architecture-specific points; this does not affect long-term error flow, though). Section 2 briefly reviews previous work. Section 3 begins with an outline of the detailed analysis of vanishing errors due to Hochreiter (1991). It then introduces a naive approach to constant error backpropagation for didactic purposes and highlights its problems concerning information storage and retrieval. These problems lead to the LSTM architecture described in section 4. Section 5 presents numerous experiments and comparisons with competing methods. LSTM outperforms them and also learns to solve complex, artificial tasks no other recurrent net algorithm has solved. Section 6 discusses LSTM’s limitations and advantages. The appendix contains a detailed description of the algorithm (A.1) and explicit error flow formulas (A.2). 2 Previous Work This section focuses on recurrent nets with time-varying inputs (as opposed to nets with stationary inputs and fixed-point-based gradient calculations; e.g., Almeida, 1987; Pineda, 1987). 2.1 Gradient-Descent Variants. The approaches of Elman (1988), Fahlman (1991), Williams (1989), Schmidhuber (1992a), Pearlmutter (1989), and many of the related algorithms in Pearlmutter’s comprehensive overview (1995) suffer from the same problems as BPTT and RTRL (see sections 1 and 3). 2.2 Time Delays. Other methods that seem practical for short time lags only are time-delay neural networks (Lang, Waibel, & Hinton, 1990) and Plate’s method (Plate, 1993), which updates unit activations based on a
Long Short-Term Memory
1737
weighted sum of old activations (see also de Vries & Principe, 1991). Lin et al. (1996) propose variants of time-delay networks called NARX networks. 2.3 Time Constants. To deal with long time lags, Mozer (1992) uses time constants influencing changes of unit activations (deVries and Principe’s 1991 approach may in fact be viewed as a mixture of time-delay neural networks and time constants). For long time lags, however, the time constants need external fine tuning (Mozer, 1992). Sun, Chen, and Lee’s alternative approach (1993) updates the activation of a recurrent unit by adding the old activation and the (scaled) current net input. The net input, however, tends to perturb the stored information, which makes long-term storage impractical. 2.4 Ring’s Approach. Ring (1993) also proposed a method for bridging long time lags. Whenever a unit in his network receives conflicting error signals, he adds a higher-order unit influencing appropriate connections. Although his approach can sometimes be extremely fast, to bridge a time lag involving 100 steps may require the addition of 100 units. Also, Ring’s net does not generalize to unseen lag durations. 2.5 Bengio et al.’s Approach. Bengio, Simard, and Frasconi (1994) investigate methods such as simulated annealing, multigrid random search, time-weighted pseudo-Newton optimization, and discrete error propagation. Their “latch” and “two-sequence” problems are very similar to problem 3a in this article with minimal time lag 100 (see Experiment 3). Bengio and Frasconi (1994) also propose an expectation-maximazation approach for propagating targets. With n so-called state networks, at a given time, their system can be in one of only n different states. (See also the beginning of section 5.) But to solve continuous problems such as the adding problem (section 5.4), their system would require an unacceptable number of states (i.e., state networks). 2.6 Kalman Filters. Puskorius and Feldkamp (1994) use Kalman filter techniques to improve recurrent net performance. Since they use “a derivative discount factor imposed to decay exponentially the effects of past dynamic derivatives,” there is no reason to believe that their Kalman filtertrained recurrent networks will be useful for very long minimal time lags. 2.7 Second Order Nets. We will see that LSTM uses multiplicative units (MUs) to protect error flow from unwanted perturbations. It is not the first recurrent net method using MUs, though. For instance, Watrous and Kuhn (1992) use MUs in second-order nets. There are some differences from LSTM: (1) Watrous and Kuhn’s architecture does not enforce constant error flow and is not designed to solve long-time-lag problems; (2) it has fully connected second-order sigma-pi units, while the LSTM architecture’s MUs
1738
Sepp Hochreiter and Jurgen ¨ Schmidhuber
are used only to gate access to constant error flow; and (3) Watrous and Kuhn’s algorithm costs O(W 2 ) operations per time step, ours only O(W), where W is the number of weights. See also Miller and Giles (1993) for additional work on MUs. 2.8 Simple Weight Guessing. To avoid long-time-lag problems of gradient-based approaches, we may simply randomly initialize all network weights until the resulting net happens to classify all training sequences correctly. In fact, recently we discovered (Schmidhuber & Hochreiter, 1996; Hochreiter & Schmidhuber, 1996, 1997) that simple weight guessing solves many of the problems in Bengio et al. (1994), Bengio and Frasconi (1994), Miller and Giles (1993), and Lin et al. (1996) faster than the algorithms these authors proposed. This does not mean that weight guessing is a good algorithm. It just means that the problems are very simple. More realistic tasks require either many free parameters (e.g., input weights) or high weight precision (e.g., for continuous-valued parameters), such that guessing becomes completely infeasible. 2.9 Adaptive Sequence Chunkers. Schmidhuber’s hierarchical chunker systems (1992b, 1993) do have a capability to bridge arbitrary time lags, but only if there is local predictability across the subsequences causing the time lags (see also Mozer, 1992). For instance, in his postdoctoral thesis, Schmidhuber (1993) uses hierarchical recurrent nets to solve rapidly certain grammar learning tasks involving minimal time lags in excess of 1000 steps. The performance of chunker systems, however, deteriorates as the noise level increases and the input sequences become less compressible. LSTM does not suffer from this problem. 3 Constant Error Backpropagation 3.1 Exponentially Decaying Error 3.1.1 Conventional BPTT (e.g., Williams & Zipser, 1992). Output unit k’s target at time t is denoted by dk (t). Using mean squared error, k’s error signal is ϑk (t) = fk0 (netk (t))(dk (t) − yk (t)), where yi (t) = fi (neti (t)) is the activation of a noninput unit i with differentiable activation function fi , X wij y j (t − 1) neti (t) = j
Long Short-Term Memory
1739
is unit i’s current net input, and wij is the weight on the connection from unit j to i. Some nonoutput unit j’s backpropagated error signal is X wij ϑi (t + 1). ϑj (t) = fj0 (netj (t)) i
The corresponding contribution to wjl ’s total weight update is αϑj (t)yl (t−1), where α is the learning rate and l stands for an arbitrary unit connected to unit j. 3.1.2 Outline of Hochreiter’s Analysis (1991, pp. 19–21). Suppose we have a fully connected net whose noninput unit indices range from 1 to n. Let us focus on local error flow from unit u to unit v (later we will see that the analysis immediately extends to global error flow). The error occurring at an arbitrary unit u at time step t is propagated back into time for q time steps, to an arbitrary unit v. This will scale the error by the following factor: ∂ϑv (t − q) = ∂ϑu (t)
(
fv0 (netv (t − 1))wuv P ∂ϑ (t−q+1) fv0 (netv (t − q)) nl=1 l∂ϑu (t) wlv
q=1 . q>1
(3.1)
With lq = v and l0 = u, we obtain: q n n X Y ∂ϑv (t − q) X = ... fl0m (netlm (t − m))wlm lm−1 ∂ϑu (t) l =1 l =1 m=1 1
(3.2)
q−1
Qq (proof by induction). The sum of the nq−1 terms m=1 fl0m (netlm (t−m))wlm lm−1 determines the total error backflow (note that since the summation terms may have different signs, increasing the number of units n does not necessarily increase error flow). 3.1.3 Intuitive Explanation of Equation 3.2.
If
| fl0m (netlm (t − m))wlm lm−1 | > 1.0 for all m (as can happen, e.g., with linear flm ), then the largest product increases exponentially with q. That is, the error blows up, and conflicting error signals arriving at unit v can lead to oscillating weights and unstable learning (for error blowups or bifurcations, see also Pineda, 1988; Baldi & Pineda, 1991; Doya, 1992). On the other hand, if | fl0m (netlm (t − m))wlm lm−1 | < 1.0 for all m, then the largest product decreases exponentially with q. That is, the error vanishes, and nothing can be learned in acceptable time.
1740
Sepp Hochreiter and Jurgen ¨ Schmidhuber
If flm is the logistic sigmoid function, then the maximal value of fl0m is 0.25. If ylm−1 is constant and not equal to zero, then | fl0m (netlm )wlm lm−1 | takes on maximal values where µ ¶ 1 1 netlm , wlm lm−1 = l coth 2 y m−1 goes to zero for |wlm lm−1 | → ∞, and is less than 1.0 for |wlm lm−1 | < 4.0 (e.g., if the absolute maximal weight value wmax is smaller than 4.0). Hence with conventional logistic sigmoid activation functions, the error flow tends to vanish as long as the weights have absolute values below 4.0, especially in the beginning of the training phase. In general, the use of larger initial weights will not help, though, as seen above, for |wlm lm−1 | → ∞ the relevant derivative goes to zero “faster” than the absolute weight can grow (also, some weights will have to change their signs by crossing zero). Likewise, increasing the learning rate does not help either; it will not change the ratio of long-range error flow and short-range error flow. BPTT is too sensitive to recent distractions. (A very similar, more recent analysis was presented by Bengio et al., 1994.) 3.1.4 Global Error Flow. The local error flow analysis above immediately shows that global error flow vanishes too. To see this, compute X ∂ϑv (t − q) . ∂ϑu (t) u: u output unit 3.1.5 Weak Upper Bound for Scaling Factor. The following, slightly extended vanishing error analysis also takes n, the number of units, into account. For q > 1, equation 3.2 can be rewritten as (WuT )T F0 (t − 1)
q−1 Y
(WF0 (t − m)) Wv fv0 (netv (t − q)),
m=2
where the weight matrix W is defined by [W]ij := wij , v’s outgoing weight vector Wv is defined by [Wv ]i := [W]iv = wiv , u’s incoming weight vector WuT is defined by [WuT ]i := [W]ui = wui , and for m = 1, . . . , q, F0 (t − m) is the diagonal matrix of first-order derivatives defined as [F0 (t − m)]ij := 0 if i 6= j, and [F0 (t−m)]ij := fi0 (neti (t−m)) otherwise. Here T is the transposition operator, [A]ij is the element in the ith column and jth row of matrix A, and [x]i is the ith component of vector x. Using a matrix norm k · kA compatible with vector norm k · kx , we define 0 fmax := max {kF0 (t − m)kA }. m=1,...,q
For maxi=1,...,n {|xi |} ≤ kxkx we get |xT y| ≤ n kxkx kykx . Since 0 | fv0 (netv (t − q))| ≤ kF0 (t − q)kA ≤ fmax ,
Long Short-Term Memory
1741
we obtain the following inequality: ¯ ¯ ¯ ∂ϑv (t − q) ¯ ¡ 0 ¢q q−2 0 q ¯ ¯ ≤ n fmax kWkA . ¯ ∂ϑ (t) ¯ ≤ n ( fmax ) kWv kx kWuT kx kWkA u This inequality results from kWv kx = kWev kx ≤ kWkA kev kx ≤ kWkA and kWuT kx = kW T eu kx ≤ kWkA keu kx ≤ kWkA , where ek is the unit vector whose components are 0 except for the kth component, which is 1. Note that this is a weak, extreme case upper bound; it will be reached only if all kF0 (t − m)kA take on maximal values, and if the contributions of all paths across which error flows back from unit u to unit v have the same sign. Large kWkA , however, typically result in small values of kF0 (t − m)kA , as confirmed by experiments (see, e.g., Hochreiter, 1991). For example, with norms X |wrs | kWkA := max r
s
and kxkx := max |xr |, r
0 we have fmax = 0.25 for the logistic sigmoid. We observe that if
|wij | ≤ wmax <
4.0 ∀i, j, n
then ¡kWkA¢ ≤ nwmax < 4.0 will result in exponential decay. By setting max < 1.0, we obtain τ := nw4.0 ¯ ¯ ¯ ∂ϑv (t − q) ¯ q ¯ ¯ ¯ ∂ϑ (t) ¯ ≤ n (τ ) . u We refer to Hochreiter (1991) for additional results. 3.2 Constant Error Flow: Naive Approach. 3.2.1 A Single Unit. To avoid vanishing error signals, how can we achieve constant error flow through a single unit j with a single connection to itself? According to the rules above, at time t, j’s local error backflow is ϑj (t) = fj0 (netj (t))ϑj (t + 1)wjj . To enforce constant error flow through j, we require fj0 (netj (t))wjj = 1.0.
1742
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Note the similarity to Mozer’s fixed time constant system (1992)—a time constant of 1.0 is appropriate for potentially infinite time lags.1 3.2.2 The Constant Error Carousel. Integrating the differential equation above, we obtain fj (netj (t)) =
netj (t) wjj
for arbitrary netj (t). This means fj has to be linear, and unit j’s activation has to remain constant: yj (t + 1) = fj (netj (t + 1)) = fj (wjj y j (t)) = y j (t). In the experiments, this will be ensured by using the identity function fj : fj (x) = x, ∀x, and by setting wjj = 1.0. We refer to this as the constant error carousel (CEC). CEC will be LSTM’s central feature (see section 4). Of course, unit j will not only be connected to itself but also to other units. This invokes two obvious, related problems (also inherent in all other gradient-based approaches): 1. Input weight conflict: For simplicity, let us focus on a single additional input weight wji . Assume that the total error can be reduced by switching on unit j in response to a certain input and keeping it active for a long time (until it helps to compute a desired output). Provided i is nonzero, since the same incoming weight has to be used for both storing certain inputs and ignoring others, wji will often receive conflicting weight update signals during this time (recall that j is linear). These signals will attempt to make wji participate in (1) storing the input (by switching on j) and (2) protecting the input (by preventing j from being switched off by irrelevant later inputs). This conflict makes learning difficult and calls for a more context-sensitive mechanism for controlling write operations through input weights. 2. Output weight conflict: Assume j is switched on and currently stores some previous input. For simplicity, let us focus on a single additional outgoing weight wkj . The same wkj has to be used for both retrieving j’s content at certain times and preventing j from disturbing k at other times. As long as unit j is nonzero, wkj will attract conflicting weight update signals generated during sequence processing. These signals will attempt to make wkj participate in accessing the information stored in j and—at different times—protecting unit k from being perturbed by j. For instance, with many tasks there are certain shorttime-lag errors that can be reduced in early training stages. However, 1 We do not use the expression “time constant” in the differential sense, as Pearlmutter (1995) does.
Long Short-Term Memory
1743
at later training stages, j may suddenly start to cause avoidable errors in situations that already seemed under control by attempting to participate in reducing more difficult long-time-lag errors. Again, this conflict makes learning difficult and calls for a more context-sensitive mechanism for controlling read operations through output weights. Of course, input and output weight conflicts are not specific for long time lags; they occur for short time lags as well. Their effects, however, become particularly pronounced in the long-time-lag case. As the time lag increases, stored information must be protected against perturbation for longer and longer periods, and, especially in advanced stages of learning, more and more already correct outputs also require protection against perturbation. Due to the problems set out, the naive approach does not work well except in the case of certain simple problems involving local input-output representations and nonrepeating input patterns (see Hochreiter, 1991; Silva, Amarel, Langlois, & Almeida, 1996). The next section shows how to do it right. 4 The Concept of Long Short-Term Memory 4.1 Memory Cells and Gate Units. To construct an architecture that allows for constant error flow through special, self-connected units without the disadvantages of the naive approach, we extend the CEC embodied by the self-connected, linear unit j from section 3.2 by introducing additional features. A multiplicative input gate unit is introduced to protect the memory contents stored in j from perturbation by irrelevant inputs, and a multiplicative output gate unit is introduced to protect other units from perturbation by currently irrelevant memory contents stored in j. The resulting, more complex unit is called a memory cell (see Figure 1). The jth memory cell is denoted cj . Each memory cell is built around a central linear unit with a fixed self-connection (the CEC). In addition to netcj , cj gets input from a multiplicative unit outj (the output gate), and from another multiplicative unit inj (the input gate). inj ’s activation at time t is denoted in
outj (t).
by y j (t), outj ’s by y
We have in
youtj (t) = foutj (netoutj (t)); y j (t) = finj (netinj (t)); where netoutj (t) =
X
woutj u yu (t − 1),
u
and netinj (t) =
X u
winj u yu (t − 1).
1744
Sepp Hochreiter and Jurgen ¨ Schmidhuber
net c
s c = s c + g y in
j
g y in
g
wc i
y in
j
j
1.0
y out net in
j
h y out
h
j
w in i
yc
j
j
j
j
j
wic net out
j
wout
j
j
j
j
i
Figure 1: Architecture of memory cell cj (the box) and its gate units inj , outj . The self-recurrent connection (with weight 1.0) indicates feedback with a delay of one time step. It builds the basis of the CEC. The gate units open and close access to CEC. See text and appendix A.1 for details.
We also have netcj (t) =
X
wcj u yu (t − 1).
u
The summation indices u may stand for input units, gate units, memory cells, or even conventional hidden units if there are any (see section 4.3). All these different types of units may convey useful information about the current state of the net. For instance, an input gate (output gate) may use inputs from other memory cells to decide whether to store (access) certain information in its memory cell. There even may be recurrent self-connections like wcj cj . It is up to the user to define the network topology. See Figure 2 for an example. At time t, cj ’s output ycj (t) is computed as ycj (t) = youtj (t)h(scj (t)), where the internal state scj (t) is ¡ ¢ in scj (0) = 0, scj (t) = scj (t − 1) + y j (t)g netcj (t) for t > 0. The differentiable function g squashes netcj ; the differentiable function h scales memory cell outputs computed from the internal state scj . 4.2 Why Gate Units? To avoid input weight conflicts, inj controls the error flow to memory cell cj ’s input connections wcj i . To circumvent cj ’s output weight conflicts, outj controls the error flow from unit j’s output
Long Short-Term Memory
1745
output
cell 1 cell 2 block block 1 1
out 1
in 1
out 2
in 2
cell 1 cell 2 block block 2 2
hidden
input
Figure 2: Example of a net with eight input units, four output units, and two memory cell blocks of size 2. in1 marks the input gate, out1 marks the output gate, and cell1/block1 marks the first memory cell of block 1. cell1/block1’s architecture is identical to the one in Figure 1, with gate units in1 and out1 (note that by rotating Figure 1 by 90 degrees anticlockwise, it will match with the corresponding parts of Figure 2). The example assumes dense connectivity: each gate unit and each memory cell sees all non-output units. For simplicity, however, outgoing weights of only one type of unit are shown for each layer. With the efficient, truncated update rule, error flows only through connections to output unit, and through fixed self-connections within cell blocks (not shown here; see Figure 1). Error flow is truncated once it “wants” to leave memory cells or gate units. Therefore, no connection shown above serves to propagate error back to the unit from which the connection originates (except for connections to output units), although the connections themselves are modifiable. That is why the truncated LSTM algorithm is so efficient, despite its ability to bridge very long time lags. See the text and the appendix for details. Figure 2 shows the architecture used for experiment 6a; only the bias of the noninput units is omitted.
connections. In other words, the net can use inj to decide when to keep or override information in memory cell cj and outj to decide when to access memory cell cj and when to prevent other units from being perturbed by cj (see Figure 1). Error signals trapped within a memory cell’s CEC cannot change, but different error signals flowing into the cell (at different times) via its output gate may get superimposed. The output gate will have to learn which
1746
Sepp Hochreiter and Jurgen ¨ Schmidhuber
errors to trap in its CEC by appropriately scaling them. The input gate will have to learn when to release errors, again by appropriately scaling them. Essentially the multiplicative gate units open and close access to constant error flow through CEC. Distributed output representations typically do require output gates. Both gate types are not always necessary, though; one may be sufficient. For instance, in experiments 2a and 2b in section 5, it will be possible to use input gates only. In fact, output gates are not required in case of local output encoding; preventing memory cells from perturbing already learned outputs can be done by simply setting the corresponding weights to zero. Even in this case, however, output gates can be beneficial: they prevent the net’s attempts at storing long-time-lag memories (which are usually hard to learn) from perturbing activations representing easily learnable short-timelag memories. (This will prove quite useful in experiment 1, for instance.) 4.3 Network Topology. We use networks with one input layer, one hidden layer, and one output layer. The (fully) self-connected hidden layer contains memory cells and corresponding gate units (for convenience, we refer to both memory cells and gate units as being located in the hidden layer). The hidden layer may also contain conventional hidden units providing inputs to gate units and memory cells. All units (except for gate units) in all layers have directed connections (serve as inputs) to all units in the layer above (or to all higher layers; see experiments 2a and 2b). 4.4 Memory Cell Blocks. S memory cells sharing the same input gate and the same output gate form a structure called a memory cell block of size S. Memory cell blocks facilitate information storage. As with conventional neural nets, it is not so easy to code a distributed input within a single cell. Since each memory cell block has as many gate units as a single memory cell (namely, two), the block architecture can be even slightly more efficient. A memory cell block of size 1 is just a simple memory cell. In the experiments in section 5, we will use memory cell blocks of various sizes. 4.5 Learning. We use a variant of RTRL (e.g., Robinson & Fallside, 1987) that takes into account the altered, multiplicative dynamics caused by input and output gates. To ensure nondecaying error backpropagation through internal states of memory cells, as with truncated BPTT (e.g., Williams & Peng, 1990), errors arriving at memory cell net inputs (for cell cj , this includes netcj , netinj , netoutj ) do not get propagated back further in time (although they do serve to change the incoming weights). Only within memory cells, are errors propagated back through previous internal states scj .2 To visualize 2 For intracellular backpropagation in a quite different context, see also Doya and Yoshizawa (1989).
Long Short-Term Memory
1747
this, once an error signal arrives at a memory cell output, it gets scaled by output gate activation and h0 . Then it is within the memory cell’s CEC, where it can flow back indefinitely without ever being scaled. When it leaves the memory cell through the input gate and g, it is scaled once more by input gate activation and g0 . It then serves to change the incoming weights before it is truncated (see the appendix for formulas). 4.6 Computational Complexity. As with Mozer’s focused recurrent backpropagation algorithm (Mozer, 1989), only the derivatives ∂scj /∂wil need to be stored and updated. Hence the LSTM algorithm is very efficient, with an excellent update complexity of O(W), where W the number of weights (see details in the appendix). Hence, LSTM and BPTT for fully recurrent nets have the same update complexity per time step (while RTRL’s is much worse). Unlike full BPTT, however, LSTM is local in space and time:3 there is no need to store activation values observed during sequence processing in a stack with potentially unlimited size. 4.7 Abuse Problem and Solutions. In the beginning of the learning phase, error reduction may be possible without storing information over time. The network will thus tend to abuse memory cells, for example, as bias cells (it might make their activations constant and use the outgoing connections as adaptive thresholds for other units). The potential difficulty is that it may take a long time to release abused memory cells and make them available for further learning. A similar “abuse problem” appears if two memory cells store the same (redundant) information. There are at least two solutions to the abuse problem: (1) sequential network construction (e.g., Fahlman, 1991): a memory cell and the corresponding gate units are added to the network whenever the error stops decreasing (see experiment 2 in section 5), and (2) output gate bias: each output gate gets a negative initial bias, to push initial memory cell activations toward zero. Memory cells with more negative bias automatically get “allocated” later (see experiments 1, 3, 4, 5, and 6 in section 5). 4.8 Internal State Drift and Remedies. If memory cell cj ’s inputs are mostly positive or mostly negative, then its internal state sj will tend to drift away over time. This is potentially dangerous, for the h0 (sj ) will then adopt very small values, and the gradient will vanish. One way to circumvent this problem is to choose an appropriate function h. But h(x) = x, for instance, has the disadvantage of unrestricted memory cell output range. Our simple 3 Following Schmidhuber (1989), we say that a recurrent net algorithm is local in space if the update complexity per time step and weight does not depend on network size. We say that a method is local in time if its storage requirements do not depend on input sequence length. For instance, RTRL is local in time but not in space. BPTT is local in space but not in time.
1748
Sepp Hochreiter and Jurgen ¨ Schmidhuber
but effective way of solving drift problems at the beginning of learning is initially to bias the input gate inj toward zero. Although there is a trade-off between the magnitudes of h0 (sj ) on the one hand and of yinj and fin0 j on the other, the potential negative effect of input gate bias is negligible compared to the one of the drifting effect. With logistic sigmoid activation functions, there appears to be no need for fine-tuning the initial bias, as confirmed by experiments 4 and 5 in section 5.4. 5 Experiments Which tasks are appropriate to demonstrate the quality of a novel longtime-lag algorithm? First, minimal time lags between relevant input signals and corresponding teacher signals must be long for all training sequences. In fact, many previous recurrent net algorithms sometimes manage to generalize from very short training sequences to very long test sequences (see, e.g., Pollack, 1991). But a real long-time-lag problem does not have any short-time-lag exemplars in the training set. For instance, Elman’s training procedure, BPTT, offline RTRL, online RTRL, and others fail miserably on real long-time-lag problems. (See, e.g., Hochreiter, 1991; Mozer, 1992.) A second important requirement is that the tasks should be complex enough such that they cannot be solved quickly by simple-minded strategies such as random weight guessing. Recently we discovered (Schmidhuber & Hochreiter, 1996; Hochreiter & Schmidhuber, 1996, 1997) that many long-time-lag tasks used in previous work can be solved more quickly by simple random weight guessing than by the proposed algorithms. For instance, guessing solved a variant of Bengio and Frasconi’s parity problem (1994) much faster4 than the seven methods tested by Bengio et al. (1994) and Bengio and Frasconi (1994). The same is true for some of Miller and Giles’s problems (1993). Of course, this does not mean that guessing is a good algorithm. It just means that some previously used problems are not extremely appropriate to demonstrate the quality of previously proposed algorithms. All our experiments (except experiment 1) involve long minimal time lags; there are no short-time-lag training exemplars facilitating learning. Solutions to most of our tasks are sparse in weight space. They require either many parameters and inputs or high weight precision, such that random weight guessing becomes infeasible. We always use online learning (as opposed to batch learning) and logistic sigmoids as activation functions. For experiments 1 and 2, initial weights are chosen in the range [−0.2, 0.2], for the other experiments in [−0.1, 0.1]. Training sequences are generated randomly according to the various task 4 Different input representations and different types of noise may lead to worse guessing performance (Yoshua Bengio, personal communication, 1996).
Long Short-Term Memory
1749
descriptions. In slight deviation from the notation in appendix A.1, each discrete time step of each input sequence involves three processing steps: (1) use current input to set the input units, (2) compute activations of hidden units (including input gates, output gates, memory cells), and (3) compute output unit activations. Except for experiments 1, 2a, and 2b, sequence elements are randomly generated online, and error signals are generated only at sequence ends. Net activations are reset after each processed input sequence. For comparisons with recurrent nets taught by gradient descent, we give results only for RTRL, except for comparison 2a, which also includes BPTT. Note, however, that untruncated BPTT (see, e.g., Williams & Peng, 1990) computes exactly the same gradient as offline RTRL. With long-time-lag problems, offline RTRL (or BPTT) and the online version of RTRL (no activation resets, online weight changes) lead to almost identical, negative results (as confirmed by additional simulations in Hochreiter, 1991; see also Mozer, 1992). This is because offline RTRL, online RTRL, and full BPTT all suffer badly from exponential error decay. Our LSTM architectures are selected quite arbitrarily. If nothing is known about the complexity of a given problem, a more systematic approach would be to: start with a very small net consisting of one memory cell. If this does not work, try two cells, and so on. Alternatively, use sequential network construction (e.g., Fahlman, 1991). Following is an outline of the experiments: • Experiment 1 focuses on a standard benchmark test for recurrent nets: the embedded Reber grammar. Since it allows for training sequences with short time lags, it is not a long-time-lag problem. We include it because it provides a nice example where LSTM’s output gates are truly beneficial, and it is a popular benchmark for recurrent nets that has been used by many authors. We want to include at least one experiment where conventional BPTT and RTRL do not fail completely (LSTM, however, clearly outperforms them). The embedded Reber grammar’s minimal time lags represent a border case in the sense that it is still possible to learn to bridge them with conventional algorithms. Only slightly longer minimal time lags would make this almost impossible. The more interesting tasks in our article, however, are those that RTRL, BPTT, and others cannot solve at all. • Experiment 2 focuses on noise-free and noisy sequences involving numerous input symbols distracting from the few important ones. The most difficult task (task 2c) involves hundreds of distractor symbols at random positions and minimal time lags of 1000 steps. LSTM solves it; BPTT and RTRL already fail in case of 10-step minimal time lags (see also Hochreiter, 1991; Mozer, 1992). For this reason RTRL and BPTT are omitted in the remaining, more complex experiments, all of which involve much longer time lags.
1750
Sepp Hochreiter and Jurgen ¨ Schmidhuber
• Experiment 3 addresses long-time-lag problems with noise and signal on the same input line. Experiments 3a and 3b focus on Bengio et al.’s 1994 two-sequence problem. Because this problem can be solved quickly by random weight guessing, we also include a far more difficult two-sequence problem (experiment 3c), which requires learning real-valued, conditional expectations of noisy targets, given the inputs. • Experiments 4 and 5 involve distributed, continuous-valued input representations and require learning to store precise, real values for very long time periods. Relevant input signals can occur at quite different positions in input sequences. Again minimal time lags involve hundreds of steps. Similar tasks never have been solved by other recurrent net algorithms. • Experiment 6 involves tasks of a different complex type that also has not been solved by other recurrent net algorithms. Again, relevant input signals can occur at quite different positions in input sequences. The experiment shows that LSTM can extract information conveyed by the temporal order of widely separated inputs. Section 5.7 provides a detailed summary of experimental conditions in two tables for reference. 5.1 Experiment 1: Embedded Reber Grammar. 5.1.1 Task. Our first task is to learn the embedded Reber grammar (Smith & Zipser, 1989; Cleeremans, Servan-Schreiber, & McClelland, 1989; Fahlman,1991). Since it allows for training sequences with short time lags (of as few as nine steps), it is not a long-time-lag problem. We include it for two reasons: (1) it is a popular recurrent net benchmark used by many authors, and we wanted to have at least one experiment where RTRL and BPTT do not fail completely, and (2) it shows nicely how output gates can be beneficial. Starting at the left-most node of the directed graph in Figure 3, symbol strings are generated sequentially (beginning with the empty string) by following edges—and appending the associated symbols to the current string—until the right-most node is reached (the Reber grammar substrings are analogously generated from Figure 4). Edges are chosen randomly if there is a choice (probability: 0.5). The net’s task is to read strings, one symbol at a time, and to predict the next symbol (error signals occur at every time step). To predict the symbol before last, the net has to remember the second symbol. 5.1.2 Comparison. We compare LSTM to Elman nets trained by Elman’s training procedure (ELM) (results taken from Cleeremans et al., 1989), Fahlman’s recurrent cascade-correlation (RCC) (results taken from Fahlman,
Long Short-Term Memory
1751
S
X
S
T B
X
E
P
P
V V
T
Figure 3: Transition diagram for the Reber grammar.
REBER GRAMMAR
T
T
B
E
P
P REBER GRAMMAR
Figure 4: Transition diagram for the embedded Reber grammar. Each box represents a copy of the Reber grammar (see Figure 3).
1752
Sepp Hochreiter and Jurgen ¨ Schmidhuber
1991), and RTRL (results taken from Smith & Zipser, 1989), where only the few successful trials are listed). Smith and Zipser actually make the task easier by increasing the probability of short-time-lag exemplars. We did not do this for LSTM. 5.1.3 Training/Testing. We use a local input-output representation (seven input units, seven output units). Following Fahlman, we use 256 training strings and 256 separate test strings. The training set is generated randomly; training exemplars are picked randomly from the training set. Test sequences are generated randomly, too, but sequences already used in the training set are not used for testing. After string presentation, all activations are reinitialized with zeros. A trial is considered successful if all string symbols of all sequences in both test set and training set are predicted correctly— that is, if the output unit(s) corresponding to the possible next symbol(s) is(are) always the most active ones. 5.1.4 Architectures. Architectures for RTRL, ELM, and RCC are reported in the references listed above. For LSTM, we use three (four) memory cell blocks. Each block has two (one) memory cells. The output layer’s only incoming connections originate at memory cells. Each memory cell and each gate unit receives incoming connections from all memory cells and gate units (the hidden layer is fully connected; less connectivity may work as well). The input layer has forward connections to all units in the hidden layer. The gate units are biased. These architecture parameters make it easy to store at least three input signals (architectures 3-2 and 4-1 are employed to obtain comparable numbers of weights for both architectures: 264 for 4-1 and 276 for 3-2). Other parameters may be appropriate as well, however. All sigmoid functions are logistic with output range [0, 1], except for h, whose range is [−1, 1], and g, whose range is [−2, 2]. All weights are initialized in [−0.2, 0.2], except for the output gate biases, which are initialized to −1, −2, and −3, respectively (see abuse problem, solution 2 of section 4). We tried learning rates of 0.1, 0.2, and 0.5. 5.1.5 Results. We use three different, randomly generated pairs of training and test sets. With each such pair we run 10 trials with different initial weights. See Table 1 for results (mean of 30 trials). Unlike the other methods, LSTM always learns to solve the task. Even when we ignore the unsuccessful trials of the other approaches, LSTM learns much faster. 5.1.6 Importance of Output Gates. The experiment provides a nice example where the output gate is truly beneficial. Learning to store the first T or P should not perturb activations representing the more easily learnable transitions of the original Reber grammar. This is the job of the output gates. Without output gates, we did not achieve fast learning.
Long Short-Term Memory
1753
Table 1: Experiment 1: Embedded Reber Grammar.
Method
Hidden Units
Number of Weights
Learning Rate
RTRL RTRL ELM RCC LSTM LSTM LSTM LSTM LSTM
3 12 15 7–9 4 blocks, size 1 3 blocks, size 2 3 blocks, size 2 4 blocks, size 1 3 blocks, size 2
≈ 170 ≈ 494 ≈ 435 ≈ 119–198 264 276 276 264 276
0.05 0.1
0.1 0.1 0.2 0.5 0.5
% of Success
After
Some fraction Some fraction 0 50 100 100 97 97 100
173,000 25,000 >200,000 182,000 39,740 21,730 14,060 9500 8440
Notes: Percentage of successful trials and number of sequence presentations until success for RTRL (results taken from Smith & Zipser, 1989), Elman net trained by Elman’s procedure (results taken from Cleeremans et al., 1989), recurrent cascade-correlation (results taken from Fahlman, 1991), and our new approach (LSTM). Weight numbers in the first four rows are estimates, the corresponding papers do not provide all the technical details. Only LSTM almost always learns to solve the task (only 2 failures out of 150 trials). Even when we ignore the unsuccessful trials of the other approaches, LSTM learns much faster (the number of required training examples in the bottom row varies between 3800 and 24,100).
5.2 Experiment 2: Noise-Free and Noisy Sequences. 5.2.1 Task 2a: Noise-Free Sequences with Long Time Lags. There are p + 1 possible input symbols denoted a1 , . . . , ap−1 , ap = x, ap+1 = y. ai is locally represented by the p + 1-dimensional vector whose ith component is 1 (all other components are 0). A net with p + 1 input units and p + 1 output units sequentially observes input symbol sequences, one at a time, permanently trying to predict the next symbol; error signals occur at every time step. To emphasize the long-time-lag problem, we use a training set consisting of only two very similar sequences: (y, a1 , a2 , . . . , ap−1 , y) and (x, a1 , a2 , . . . , ap−1 , x). Each is selected with probability 0.5. To predict the final element, the net has to learn to store a representation of the first element for p time steps. We compare real-time recurrent learning for fully recurrent nets (RTRL), back-propagation through time (BPTT), the sometimes very successful two-net neural sequence chunker (CH; Schmidhuber, 1992b), and our new method (LSTM). In all cases, weights are initialized in [−0.2, 0.2]. Due to limited computation time, training is stopped after 5 million sequence presentations. A successful run is one that fulfills the following criterion: after training, during 10,000 successive, randomly chosen input sequences, the maximal absolute error of all output units is always below 0.25.
1754
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Table 2: Task 2a: Percentage of Successful Trials and Number of Training Sequences until Success.
Method
Delay p
Learning Rate
Number of Weights
% Successful Trials
Success After
RTRL RTRL RTRL RTRL RTRL BPTT CH LSTM
4 4 4 10 100 100 100 100
1.0 4.0 10.0 1.0–10.0 1.0–10.0 1.0–10.0 1.0 1.0
36 36 36 144 10404 10404 10506 10504
78 56 22 0 0 0 33 100
1,043,000 892,000 254,000 > 5,000,000 > 5,000,000 > 5,000,000 32,400 5,040
Notes: Table entries refer to means of 18 trials. With 100 time-step delays, only CH and LSTM achieve successful trials. Even when we ignore the unsuccessful trials of the other approaches, LSTM learns much faster.
Architectures. RTRL: One self-recurrent hidden unit, p + 1 nonrecurrent output units. Each layer has connections from all layers below. All units use the logistic activation function sigmoid in [0, 1]. BPTT: Same architecture as the one trained by RTRL. CH: Both net architectures like RTRL’s, but one has an additional output for predicting the hidden unit of the other one (see Schmidhuber, 1992b, for details). LSTM: As with RTRL, but the hidden unit is replaced by a memory cell and an input gate (no output gate required). g is the logistic sigmoid, and h is the identity function h : h(x) = x, ∀x. Memory cell and input gate are added once the error has stopped decreasing (see abuse problem: solution 1 in section 4). Results. Using RTRL and a short four-time-step delay (p = 4), 7/9 of all trials were successful. No trial was successful with p = 10. With long time lags, only the neural sequence chunker and LSTM achieved successful trials; BPTT and RTRL failed. With p = 100, the two-net sequence chunker solved the task in only one-third of all trials. LSTM, however, always learned to solve the task. Comparing successful trials only, LSTM learned much faster. See Table 2 for details. It should be mentioned, however, that a hierarchical chunker can also always quickly solve this task (Schmidhuber, 1992c, 1993). 5.2.2 Task 2b: No Local Regularities. With task 2a, the chunker sometimes learns to predict the final element correctly, but only because of pre-
Long Short-Term Memory
1755
dictable local regularities in the input stream that allow for compressing the sequence. In a more difficult task, involving many more different possible sequences, we remove compressibility by replacing the deterministic subsequence (a1 , a2 , . . . , ap−1 ) by a random subsequence (of length p − 1) over the alphabet a1 , a2 , . . . , ap−1 . We obtain two classes (two sets of sequences) {(y, ai1 , ai2 , . . . , aip−1 , y) | 1 ≤ i1 , i2 , . . . , ip−1 ≤ p − 1} and {(x, ai1 , ai2 , . . . , aip−1 , x) | 1 ≤ i1 , i2 , . . . , ip−1 ≤ p − 1}. Again, every next sequence element has to be predicted. The only totally predictable targets, however, are x and y, which occur at sequence ends. Training exemplars are chosen randomly from the two classes. Architectures and parameters are the same as in experiment 2a. A successful run is one that fulfills the following criterion: after training, during 10,000 successive, randomly chosen input sequences, the maximal absolute error of all output units is below 0.25 at sequence end. Results. As expected, the chunker failed to solve this task (so did BPTT and RTRL, of course). LSTM, however, was always successful. On average (mean of 18 trials), success for p = 100 was achieved after 5680 sequence presentations. This demonstrates that LSTM does not require sequence regularities to work well. 5.2.3 Task 2c: Very Long Time Lags—No Local Regularities. This is the most difficult task in this subsection. To our knowledge, no other recurrent net algorithm can solve it. Now there are p + 4 possible input symbols denoted a1 , . . . , ap−1 , ap , ap+1 = e, ap+2 = b, ap+3 = x, ap+4 = y. a1 , . . . , ap are also called distractor symbols. Again, ai is locally represented by the p + 4-dimensional vector whose ith component is 1 (all other components are 0). A net with p + 4 input units and 2 output units sequentially observes input symbol sequences, one at a time. Training sequences are randomly chosen from the union of two very similar subsets of sequences: {(b, y, ai1 , ai2 , . . . , aiq+k , e, y) | 1 ≤ i1 , i2 , . . . , iq+k ≤ q} and {(b, x, ai1 , ai2 , . . . , aiq+k , e, x) | 1 ≤ i1 , i2 , . . . , iq+k ≤ q}. To produce a training sequence, we randomly generate a sequence prefix of length q + 2, randomly generate a sequence suffix of additional elements (6= b, e, x, y) with probability 9/10 or, alternatively, an e with probability 1/10. In the latter case, we conclude the sequence with x or y, depending on the second element. For a given k, this leads to a uniform distribution on the possible sequences with length q + k + 4. The minimal sequence length is q + 4; the expected length is 4+
µ ¶k ∞ X 9 1 (q + k) = q + 14. 10 10 k=0
The expected number of occurrences of element ai , 1 ≤ i ≤ p, in a sequence q is (q + 10)/p ≈ p . The goal is to predict the last symbol, which always occurs after the “trigger symbol” e. Error signals are generated only at sequence
1756
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Table 3: Task 2c: LSTM with Very Long Minimal Time Lags q + 1 and a Lot of Noise. q (Time Lag −1)
p (Number of Random Inputs)
q p
Number of Weights
Success After
50 100 200 500 1000 1000 1000 1000 1000
50 100 200 500 1,000 500 200 100 50
1 1 1 1 1 2 5 10 20
364 664 1264 3064 6064 3064 1264 664 364
30,000 31,000 33,000 38,000 49,000 49,000 75,000 135,000 203,000
Notes: p is the number of available distractor symbols (p + 4 is the number of input units). q/p is the expected number of occurrences of a given distractor symbol in a sequence. The right-most column lists the number of training sequences required by LSTM (BPTT, RTRL, and the other competitors have no chance of solving this task). If we let the number of distractor symbols (and weights) increase in proportion to the time lag, learning time increases very slowly. The lower block illustrates the expected slowdown due to increased frequency of distractor symbols.
ends. To predict the final element, the net has to learn to store a representation of the second element for at least q + 1 time steps (until it sees the trigger symbol e). Success is defined as prediction error (for final sequence element) of both output units always below 0.2, for 10,000 successive, randomly chosen input sequences. Architecture/Learning. The net has p + 4 input units and 2 output units. Weights are initialized in [−0.2, 0.2]. To avoid too much learning time variance due to different weight initializations, the hidden layer gets two memory cells (two cell blocks of size 1, although one would be sufficient). There are no other hidden units. The output layer receives connections only from memory cells. Memory cells and gate units receive connections from input units, memory cells, and gate units (the hidden layer is fully connected). No bias weights are used. h and g are logistic sigmoids with output ranges [−1, 1] and [−2, 2], respectively. The learning rate is 0.01. Note that the minimal time lag is q + 1; the net never sees short training sequences facilitating the classification of long test sequences. Results. Twenty trials were made for all tested pairs (p, q). Table 3 lists the mean of the number of training sequences required by LSTM to achieve success (BPTT and RTRL have no chance of solving nontrivial tasks with minimal time lags of 1000 steps).
Long Short-Term Memory
1757
Scaling. Table 3 shows that if we let the number of input symbols (and weights) increase in proportion to the time lag, learning time increases very slowly. This is another remarkable property of LSTM not shared by any other method we are aware of. Indeed, RTRL and BPTT are far from scaling reasonably; instead, they appear to scale exponentially and appear quite useless when the time lags exceed as few as 10 steps. Distractor Influence. In Table 3, the column headed by q/p gives the expected frequency of distractor symbols. Increasing this frequency decreases learning speed, an effect due to weight oscillations caused by frequently observed input symbols. 5.3 Experiment 3: Noise and Signal on Same Channel. This experiment serves to illustrate that LSTM does not encounter fundamental problems if noise and signal are mixed on the same input line. We initially focus on Bengio et al.’s simple 1994 two-sequence problem. In experiment 3c we pose a more challenging two-sequence problem. 5.3.1 Task 3a (Two-Sequence Problem). The task is to observe and then classify input sequences. There are two classes, each occurring with probability 0.5. There is only one input line. Only the first N real-valued sequence elements convey relevant information about the class. Sequence elements at positions t > N are generated by a gaussian with mean zero and variance 0.2. Case N = 1: the first sequence element is 1.0 for class 1, and −1.0 for class 2. Case N = 3: the first three elements are 1.0 for class 1 and −1.0 for class 2. The target at the sequence end is 1.0 for class 1 and 0.0 for class 2. Correct classification is defined as absolute output error at sequence end below 0.2. Given a constant T, the sequence length is randomly selected between T and T + T/10 (a difference to Bengio et al.’s problem is that they also permit shorter sequences of length T/2). Guessing. Bengio et al. (1994) and Bengio and Frasconi (1994) tested seven different methods on the two-sequence problem. We discovered, however, that random weight guessing easily outperforms them all because the problem is so simple.5 See Schmidhuber and Hochreiter (1996) and Hochreiter and Schmidhuber (1996, 1997) for additional results in this vein. LSTM Architecture. We use a three-layer net with one input unit, one output unit, and three cell blocks of size 1. The output layer receives connections only from memory cells. Memory cells and gate units receive inputs from input units, memory cells, and gate units and have bias weights. Gate 5 However, different input representations and different types of noise may lead to worse guessing performance (Yoshua Bengio, personal communication, 1996).
1758
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Table 4: Task 3a: Bengio et al.’s Two-Sequence Problem.
T
N
Stop: ST1
Stop: ST2
Number of Weights
ST2: Fraction Misclassified
100 100 1000
3 1 3
27,380 58,370 446,850
39,850 64,330 452,460
102 102 102
0.000195 0.000117 0.000078
Notes: T is minimal sequence length. N is the number of information-conveying elements at sequence begin. The column headed by ST1 (ST2) gives the number of sequence presentations required to achieve stopping criterion ST1 (ST2). The right-most column lists the fraction of misclassified posttraining sequences (with absolute error > 0.2) from a test set consisting of 2560 sequences (tested after ST2 was achieved). All values are means of 10 trials. We discovered, however, that this problem is so simple that random weight guessing solves it faster than LSTM and any other method for which there are published results.
units and output unit are logistic sigmoid in [0, 1], h in [−1, 1], and g in [−2, 2]. Training/Testing. All weights (except the bias weights to gate units) are randomly initialized in the range [−0.1, 0.1]. The first input gate bias is initialized with −1.0, the second with −3.0, and the third with −5.0. The first output gate bias is initialized with −2.0, the second with −4.0, and the third with −6.0. The precise initialization values hardly matter though, as confirmed by additional experiments. The learning rate is 1.0. All activations are reset to zero at the beginning of a new sequence. We stop training (and judge the task as being solved) according to the following criteria: ST1: none of 256 sequences from a randomly chosen test set is misclassified; ST2: ST1 is satisfied, and mean absolute test set error is below 0.01. In case of ST2, an additional test set consisting of 2560 randomly chosen sequences is used to determine the fraction of misclassified sequences. Results. See Table 4. The results are means of 10 trials with different weight initializations in the range [−0.1, 0.1]. LSTM is able to solve this problem, though by far not as fast as random weight guessing (see “Guessing” above). Clearly this trivial problem does not provide a very good testbed to compare performance of various nontrivial algorithms. Still, it demonstrates that LSTM does not encounter fundamental problems when faced with signal and noise on the same channel. 5.3.2 Task 3b. The architecture, parameters, and other elements are as in task 3a, but now with gaussian noise (mean 0 and variance 0.2) added to the
Long Short-Term Memory
1759
Table 5: Task 3b: Modified Two-Sequence Problem.
T
N
Stop: ST1
Stop: ST2
Number of Weights
ST2: Fraction Misclassified
100 100 1000
3 1 1
41,740 74,950 481,060
43,250 78,430 485,080
102 102 102
0.00828 0.01500 0.01207
Note: Same as in Table 4, but now the information-conveying elements are also perturbed by noise.
information-conveying elements (t <= N). We stop training (and judge the task as being solved) according to the following, slightly redefined criteria: ST1: fewer than 6 out of 256 sequences from a randomly chosen test set are misclassified; ST2: ST1 is satisfied, and mean absolute test set error is below 0.04. In case of ST2, an additional test set consisting of 2560 randomly chosen sequences is used to determine the fraction of misclassified sequences. Results. See Table 5. The results represent means of 10 trials with different weight initializations. LSTM easily solves the problem. 5.3.3 Task 3c. The architecture, parameters, and other elements are as in task 3a, but with a few essential changes that make the task nontrivial: the targets are 0.2 and 0.8 for class 1 and class 2, respectively, and there is gaussian noise on the targets (mean 0 and variance 0.1; S.D. 0.32). To minimize mean squared error, the system has to learn the conditional expectations of the targets given the inputs. Misclassification is defined as absolute difference between output and noise-free target (0.2 for class 1 and 0.8 for class 2) > 0.1. The network output is considered acceptable if the mean absolute difference between noise-free target and output is below 0.015. Since this requires high weight precision, task 3c (unlike tasks 3a and 3b) cannot be solved quickly by random guessing. Training/Testing. The learning rate is 0.1. We stop training according to the following criterion: none of 256 sequences from a randomly chosen test set is misclassified, and mean absolute difference between the noise-free target and output is below 0.015. An additional test set consisting of 2560 randomly chosen sequences is used to determine the fraction of misclassified sequences. Results. See Table 6. The results represent means of 10 trials with different weight initializations. Despite the noisy targets, LSTM still can solve the problem by learning the expected target values.
1760
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Table 6: Task 3c: Modified, More Challenging Two-Sequence Problem.
T
N
Stop
Number of Weights
Fraction Misclassified
Average Difference to Mean
100 100
3 1
269,650 565,640
102 102
0.00558 0.00441
0.014 0.012
Notes: Same as in Table 4, but with noisy real-valued targets. The system has to learn the conditional expectations of the targets given the inputs. The right-most column provides the average difference between network output and expected target. Unlike tasks 3a and 3b, this one cannot be solved quickly by random weight guessing.
5.4 Experiment 4: Adding Problem. The difficult task in this section is of a type that has never been solved by other recurrent net algorithms. It shows that LSTM can solve long-time-lag problems involving distributed, continuous-valued representations. 5.4.1 Task. Each element of each input sequence is a pair of components. The first component is a real value randomly chosen from the interval [−1, 1]; the second is 1.0, 0.0, or −1.0 and is used as a marker. At the end of each sequence, the task is to output the sum of the first components of those pairs that are marked by second components equal to 1.0. Sequences have random lengths between the minimal sequence length T and T + T/10. In a given sequence, exactly two pairs are marked, as follows: we first randomly select and mark one of the first 10 pairs (whose first component we call X1 ). Then we randomly select and mark one of the first T/2 − 1 still unmarked pairs (whose first component we call X2 ). The second components of all remaining pairs are zero except for the first and final pair, whose second components are −1. (In the rare case where the first pair of the sequence gets marked, we set X1 to zero.) An error signal is generated only at the sequence end: the target is 0.5 + (X1 + X2 )/4.0 (the sum X1 + X2 scaled to the interval [0, 1]). A sequence is processed correctly if the absolute error at the sequence end is below 0.04. 5.4.2 Architecture. We use a three-layer net with two input units, one output unit, and two cell blocks of size 2. The output layer receives connections only from memory cells. Memory cells and gate units receive inputs from memory cells and gate units (the hidden layer is fully connected; less connectivity may work as well). The input layer has forward connections to all units in the hidden layer. All noninput units have bias weights. These architecture parameters make it easy to store at least two input signals (a cell block size of 1 works well, too). All activation functions are logistic with output range [0, 1], except for h, whose range is [−1, 1], and g, whose range is [−2, 2].
Long Short-Term Memory
1761
Table 7: Experiment 4: Results for the Adding Problem.
T
Minimal Lag
Number of Weights
Number of Wrong Predictions
Success After
100 500 1000
50 250 500
93 93 93
1 out of 2560 0 out of 2560 1 out of 2560
74,000 209,000 853,000
Notes: T is the minimal sequence length, T/2 the minimal time lag. “Number of Wrong Predictions” is the number of incorrectly processed sequences (error > 0.04) from a test set containing 2560 sequences. The right-most column gives the number of training sequences required to achieve the stopping criterion. All values are means of 10 trials. For T = 1000 the number of required training examples varies between 370,000 and 2,020,000, exceeding 700,000 in only three cases.
5.4.3 State Drift Versus Initial Bias. Note that the task requires storing the precise values of real numbers for long durations; the system must learn to protect memory cell contents against even minor internal state drift (see section 4). To study the significance of the drift problem, we make the task even more difficult by biasing all noninput units, thus artificially inducing internal state drift. All weights (including the bias weights) are randomly initialized in the range [−0.1, 0.1]. Following section 4’s remedy for state drifts, the first input gate bias is initialized with −3.0 and the second with −6.0 (though the precise values hardly matter, as confirmed by additional experiments). 5.4.4 Training/Testing. The learning rate is 0.5. Training is stopped once the average training error is below 0.01, and the 2000 most recent sequences were processed correctly. 5.4.5 Results. With a test set consisting of 2560 randomly chosen sequences, the average test set error was always below 0.01, and there were never more than three incorrectly processed sequences. Table 7 shows details. The experiment demonstrates that LSTM is able to work well with distributed representations, LSTM is able to learn to perform calculations involving continuous values, and since the system manages to store continuous values without deterioration for minimal delays of T/2 time steps, there is no significant, harmful internal state drift. 5.5 Experiment 5: Multiplication Problem. One may argue that LSTM is a bit biased toward tasks such as the adding problem from the previous subsection. Solutions to the adding problem may exploit the CEC’s built-in integration capabilities. Although this CEC property may be viewed as a
1762
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Table 8: Experiment 5: Results for the Multiplication Problem.
T
Minimal Lag
Number of Weights
nseq
Number of Wrong Predictions
MSE
Success After
100 100
50 50
93 93
140 13
139 out of 2560 14 out of 2560
0.0223 0.0139
482,000 1,273,000
Notes: T is the minimal sequence length and T/2 the minimal time lag. We test on a test set containing 2560 sequences as soon as less than nseq of the 2000 most recent training sequences lead to error > 0.04. “Number of Wrong Predictions” is the number of test sequences with error > 0.04. MSE is the mean squared error on the test set. The right-most column lists numbers of training sequences required to achieve the stopping criterion. All values are means of 10 trials.
feature rather than a disadvantage (integration seems to be a natural subtask of many tasks occurring in the real world), the question arises whether LSTM can also solve tasks with inherently nonintegrative solutions. To test this, we change the problem by requiring the final target to equal the product (instead of the sum) of earlier marked inputs. 5.5.1 Task. This is like the task in section 5.4, except that the first component of each pair is a real value randomly chosen from the interval [0, 1]. In the rare case where the first pair of the input sequence gets marked, we set X1 to 1.0. The target at sequence end is the product X1 × X2 . 5.5.2 Architecture. This is as in section 5.4. All weights (including the bias weights) are randomly initialized in the range [−0.1, 0.1]. 5.5.3 Training/Testing. The learning rate is 0.1. We test performance twice: as soon as less than nseq of the 2000 most recent training sequences lead to absolute errors exceeding 0.04, where nseq = 140 and nseq = 13. Why these values? nseq = 140 is sufficient to learn storage of the relevant inputs. It is not enough though to fine-tune the precise final outputs. nseq = 13, however, leads to quite satisfactory results. 5.5.4 Results. For nseq = 140 (nseq = 13) with a test set consisting of 2560 randomly chosen sequences, the average test set error was always below 0.026 (0.013), and there were never more than 170 (15) incorrectly processed sequences. Table 8 shows details. (A net with additional standard hidden units or with a hidden layer above the memory cells may learn the finetuning part more quickly.) The experiment demonstrates that LSTM can solve tasks involving both continuous-valued representations and nonintegrative information processing.
Long Short-Term Memory
1763
5.6 Experiment 6: Temporal Order. In this subsection, LSTM solves other difficult (but artificial) tasks that have never been solved by previous recurrent net algorithms. The experiment shows that LSTM is able to extract information conveyed by the temporal order of widely separated inputs. 5.6.1 Task 6a: Two Relevant, Widely Separated Symbols. The goal is to classify sequences. Elements and targets are represented locally (input vectors with only one nonzero bit). The sequence starts with an E, ends with a B (the “trigger symbol”), and otherwise consists of randomly chosen symbols from the set {a, b, c, d} except for two elements at positions t1 and t2 that are either X or Y. The sequence length is randomly chosen between 100 and 110, t1 is randomly chosen between 10 and 20, and t2 is randomly chosen between 50 and 60. There are four sequence classes Q, R, S, U, which depend on the temporal order of X and Y. The rules are: X, X → Q; X, Y → R; Y, X → S; Y, Y → U. 5.6.2 Task 6b: Three Relevant, Widely Separated Symbols. Again, the goal is to classify sequences. Elements and targets are represented locally. The sequence starts with an E, ends with a B (the trigger symbol), and otherwise consists of randomly chosen symbols from the set {a, b, c, d} except for three elements at positions t1 , t2 , and t3 that are either X or Y. The sequence length is randomly chosen between 100 and 110, t1 is randomly chosen between 10 and 20, t2 is randomly chosen between 33 and 43, and t3 is randomly chosen between 66 and 76. There are eight sequence classes— Q, R, S, U, V, A, B, C—which depend on the temporal order of the Xs and Ys. The rules are: X, X, X → Q; X, X, Y → R; X, Y, X → S; X, Y, Y → U; Y, X, X → V; Y, X, Y → A; Y, Y, X → B; Y, Y, Y → C. There are as many output units as there are classes. Each class is locally represented by a binary target vector with one nonzero component. With both tasks, error signals occur only at the end of a sequence. The sequence is classified correctly if the final absolute error of all output units is below 0.3. Architecture. We use a three-layer net with eight input units, two (three) cell blocks of size 2, and four (eight) output units for task 6a (6b). Again all noninput units have bias weights, and the output layer receives connections from memory cells, only. Memory cells and gate units receive inputs from input units, memory cells, and gate units (the hidden layer is fully connected; less connectivity may work as well). The architecture parameters for task 6a (6b) make it easy to store at least two (three) input signals. All activation functions are logistic with output range [0, 1], except for h, whose range is [−1, 1], and g, whose range is [−2, 2].
1764
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Table 9: Experiment 6: Results for the Temporal Order Problem.
Task
Number of Weights
Number of Wrong Predictions
Success After
Task 6a Task 6b
156 308
1 out of 2560 2 out of 2560
31,390 571,100
Notes: “Number of Wrong Predictions” is the number of incorrectly classified sequences (error > 0.3 for at least one output unit) from a test set containing 2560 sequences. The right-most column gives the number of training sequences required to achieve the stopping criterion. The results for task 6a are means of 20 trials; those for task 6b of 10 trials.
Training/Testing. The learning rate is 0.5 (0.1) for experiment 6a (6b). Training is stopped once the average training error falls below 0.1 and the 2000 most recent sequences were classified correctly. All weights are initialized in the range [−0.1, 0.1]. The first input gate bias is initialized with −2.0, the second with −4.0, and (for experiment 6b) the third with −6.0 (again, we confirmed by additional experiments that the precise values hardly matter). Results. With a test set consisting of 2560 randomly chosen sequences, the average test set error was always below 0.1, and there were never more than three incorrectly classified sequences. Table 9 shows details. The experiment shows that LSTM is able to extract information conveyed by the temporal order of widely separated inputs. In task 6a, for instance, the delays between the first and second relevant input and between the second relevant input and sequence end are at least 30 time steps. Typical Solutions. In experiment 6a, how does LSTM distinguish between temporal orders (X, Y) and (Y, X)? One of many possible solutions is to store the first X or Y in cell block 1 and the second X/Y in cell block 2. Before the first X/Y occurs, block 1 can see that it is still empty by means of its recurrent connections. After the first X/Y, block 1 can close its input gate. Once block 1 is filled and closed, this fact will become visible to block 2 (recall that all gate units and all memory cells receive connections from all nonoutput units). Typical solutions, however, require only one memory cell block. The block stores the first X or Y; once the second X/Y occurs, it changes its state depending on the first stored symbol. Solution type 1 exploits the connection between memory cell output and input gate unit. The following events cause different input gate activations: X occurs in conjunction with a filled block; X occurs in conjunction with an empty block. Solution type 2 is based on a strong, positive connection between memory cell output and memory cell input. The previous occurrence of X (Y) is represented by a
Long Short-Term Memory
1765
Table 10: Summary of Experimental Conditions for LSTM, Part I. (1) (2) Task p 1-1 1-2 1-3 1-4 1-5 2a 2b 2c-1 2c-2 2c-3 2c-4 2c-5 2c-6 2c-7 2c-8 2c-9 3a 3b 3c 4-1 4-2 4-3 5 6a 6b
9 9 9 9 9 100 100 50 100 200 500 1000 1000 1000 1000 1000 100 100 100 100 500 1000 100 100 100
(3) (4) (5) (6) (7) lag b s in out 9 9 9 9 9 100 100 50 100 200 500 1000 1000 1000 1000 1000 100 100 100 50 250 500 50 40 24
4 3 3 4 3 1 1 2 2 2 2 2 2 2 2 2 3 3 3 2 2 2 2 2 3
(8) w
(9) c
(10) ogb
1 7 7 264 F −1, −2, −3, −4 2 7 7 276 F −1, −2, −3 2 7 7 276 F −1, −2, −3 1 7 7 264 F −1, −2, −3, −4 2 7 7 276 F −1, −2, −3 1 101 101 10,504 B No og 1 101 101 10,504 B No og 1 54 2 364 F None 1 104 2 664 F None 1 204 2 1264 F None 1 504 2 3064 F None 1 1004 2 6064 F None 1 504 2 3064 F None 1 204 2 1264 F None 1 104 2 664 F None 1 54 2 364 F None 1 1 1 102 F −2, −4, −6 1 1 1 102 F −2, −4, −6 1 1 1 102 F −2, −4, −6 2 2 1 93 F r 2 2 1 93 F r 2 2 1 93 F r 2 2 1 93 F r 2 8 4 156 F r 2 8 8 308 F r
(11) igb r r r r r None None None None None None None None None None None −1, −3, −5 −1, −3, −5 −1, −3, −5 −3, −6 −3, −6 −3, −6 r −2, −4 −2, −4, −6
(12) (13) (14) bias h g ga ga ga ga ga None None None None None None None None None None None b1 b1 b1 All All All All All All
h1 h1 h1 h1 h1 id id h1 h1 h1 h1 h1 h1 h1 h1 h1 h1 h1 h1 h1 h1 h1 h1 h1 h1
g2 g2 g2 g2 g2 g1 g1 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2 g2
(15) α 0.1 0.1 0.2 0.5 0.5 1.0 1.0 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 1.0 1.0 0.1 0.5 0.5 0.5 0.1 0.5 0.1
Notes: Col. 1: task number. Col. 2: minimal sequence length p. Col. 3: minimal number of steps between most recent relevant input information and teacher signal. Col. 4: number of cell blocks b. Col. 5: block size s. Col. 6: Number of input units in. Col. 7: Number of output units out. Col. 8: number of weights w. Col. 9: c describes connectivity: F means “output layer receives connections from memory cells; memory cells and gate units receive connections from input units, memory cells and gate units”; B means “each layer receives connections from all layers below.” Col. 10: Initial output gate bias ogb, where r stands for “randomly chosen from the interval [−0.1, 0.1]” and no og means “no output gate used.” Col. 11: initial input gate bias igb (see Col. 10). Col. 12: which units have bias weights? b1 stands for “all hidden units”, ga for “only gate units,” and all for “all noninput units.” Col. 13: the function h, where id is identity function, h1 is logistic sigmoid in [−2, 2]. Col. 14: the logistic function g, where g1 is sigmoid in [0, 1], g2 in [−1, 1]. Col. 15: learning rate α.
positive (negative) internal state. Once the input gate opens for the second time, so does the output gate, and the memory cell output is fed back to its own input. This causes (X, Y) to be represented by a positive internal state, because X contributes to the new internal state twice (via current internal state and cell output feedback). Similarly, (Y, X) gets represented by a negative internal state. 5.7 Summary of Experimental Conditions. Tables 10 and 11 provide an overview of the most important LSTM parameters and architectural details for experiments 1 through 6. The conditions of the simple experiments 2a
1766
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Table 11: Summary of Experimental Conditions for LSTM, Part II. (1) Task
(2) Select
(3) Interval
(4) Test Set Size
(5) Stopping Criterion
(6) Success
1 2a 2b 2c 3a 3b 3c 4 5 6a 6b
t1 t1 t2 t2 t3 t3 t3 t3 t3 t3 t3
[−0.2, 0.2] [−0.2, 0.2] [−0.2, 0.2] [−0.2, 0.2] [−0.1, 0.1] [−0.1, 0.1] [−0.1, 0.1] [−0.1, 0.1] [−0.1, 0.1] [−0.1, 0.1] [−0.1, 0.1]
256 no test set 10,000 10,000 2560 2560 2560 2560 2560 2560 2560
Training and test correctly pred. After 5 million exemplars After 5 million exemplars After 5 million exemplars ST1 and ST2 (see text) ST1 and ST2 (see text) ST1 and ST2 (see text) ST3(0.01) see text ST3(0.1) ST3(0.1)
See text ABS(0.25) ABS(0.25) ABS(0.2) ABS(0.2) ABS(0.2) See text ABS(0.04) ABS(0.04) ABS(0.3) ABS(0.3)
Notes: Col. 1: task number. Col. 2: training exemplar selelction, where t1 stands for “randomly chosen form training set,” t2 for “randomly chosen from two classes,” and t3 for “randomly generated on line.” Col. 3: weight initialization interval. Col. 4: test set size. Col. 5: Stopping criterion for training, where ST3(β) stands for “average training error below β and the 2000 most recent sequences were processed correctly.” Col. 6: success (correct classification) criterion, where ABS(β) stands for “absolute error of all output units at sequence end is below β.”
and 2b differ slightly from those of the other, more systematic experiments, due to historical reasons. 6 Discussion 6.1 Limitations of LSTM. • The particularly efficient truncated backpropagation version of the LSTM algorithm will not easily solve problems similar to strongly delayed XOR problems, where the goal is to compute the XOR of two widely separated inputs that previously occurred somewhere in a noisy sequence. The reason is that storing only one of the inputs will not help to reduce the expected error; the task is nondecomposable in the sense that it is impossible to reduce the error incrementally by first solving an easier subgoal. In theory, this limitation can be circumvented by using the full gradient (perhaps with additional conventional hidden units receiving input from the memory cells). But we do not recommend computing the full gradient for the following reasons: (1) It increases computational complexity, (2) constant error flow through CECs can be shown only for truncated LSTM, and (3) we actually did conduct a few experiments with nontruncated LSTM. There was no significant difference to truncated LSTM, exactly because outside the CECs, error flow tends
Long Short-Term Memory
1767
to vanish quickly. For the same reason, full BPTT does not outperform truncated BPTT. • Each memory cell block needs two additional units (input and output gate). In comparison to standard recurrent nets, however, this does not increase the number of weights by more than a factor of 9: each conventional hidden unit is replaced by at most three units in the LSTM architecture, increasing the number of weights by a factor of 32 in the fully connected case. Note, however, that our experiments use quite comparable weight numbers for the architectures of LSTM and competing approaches. • Due to its constant error flow through CECs within memory cells, LSTM generally runs into problems similar to those of feedforward nets’ seeing the entire input string at once. For instance, there are tasks that can be quickly solved by random weight guessing but not by the truncated LSTM algorithm with small weight initializations, such as the 500-step parity problem (see the introduction to section 5). Here, LSTM’s problems are similar to the ones of a feedforward net with 500 inputs, trying to solve 500-bit parity. Indeed LSTM typically behaves much like a feedforward net trained by backpropagation that sees the entire input. But that is also precisely why it so clearly outperforms previous approaches on many nontrivial tasks with significant search spaces. • LSTM does not have any problems with the notion of recency that go beyond those of other approaches. All gradient-based approaches, however, suffer from a practical inability to count discrete time steps precisely. If it makes a difference whether a certain signal occurred 99 or 100 steps ago, then an additional counting mechanism seems necessary. Easier tasks, however, such as one that requires making a difference only between, say, 3 and 11 steps, do not pose any problems to LSTM. For instance, by generating an appropriate negative connection between memory cell output and input, LSTM can give more weight to recent inputs and learn decays where necessary. 6.2 Advantages of LSTM. • The constant error backpropagation within memory cells results in LSTM’s ability to bridge very long time lags in case of problems similar to those discussed above. • For long-time-lag problems such as those discussed in this article, LSTM can handle noise, distributed representations, and continuous values. In contrast to finite state automata or hidden Markov models, LSTM does not require an a priori choice of a finite number of states. In principle, it can deal with unlimited state numbers.
1768
Sepp Hochreiter and Jurgen ¨ Schmidhuber
• For problems discussed in this article, LSTM generalizes well, even if the positions of widely separated, relevant inputs in the input sequence do not matter. Unlike previous approaches, ours quickly learns to distinguish between two or more widely separated occurrences of a particular element in an input sequence, without depending on appropriate short-time-lag training exemplars. • There appears to be no need for parameter fine tuning. LSTM works well over a broad range of parameters such as learning rate, input gate bias, and output gate bias. For instance, to some readers the learning rates used in our experiments may seem large. However, a large learning rate pushes the output gates toward zero, thus automatically countermanding its own negative effects. • The LSTM algorithm’s update complexity per weight and time step is essentially that of BPTT, namely, O(1). This is excellent in comparison to other approaches such as RTRL. Unlike full BPTT, however, LSTM is local in both space and time. 7 Conclusion Each memory cell’s internal architecture guarantees constant error flow within its CEC, provided that truncated backpropagation cuts off error flow trying to leak out of memory cells. This represents the basis for bridging very long time lags. Two gate units learn to open and close access to error flow within each memory cell’s CEC. The multiplicative input gate affords protection of the CEC from perturbation by irrelevant inputs. Similarly, the multiplicative output gate protects other units from perturbation by currently irrelevant memory contents. To find out about LSTM’s practical limitations we intend to apply it to real-world data. Application areas will include time-series prediction, music composition, and speech processing. It will also be interesting to augment sequence chunkers (Schmidhuber, 1992b, 1993) by LSTM to combine the advantages of both. Appendix A.1 Algorithm Details. In what follows, the index k ranges over output units, i ranges over hidden units, cj stands for the jth memory cell block, cjv denotes the vth unit of memory cell block cj , u, l, m stand for arbitrary units, and t ranges over all time steps of a given input sequence. The gate unit logistic sigmoid (with range [0, 1]) used in the experiments is f (x) =
1 . 1 + exp(−x)
(A.1)
Long Short-Term Memory
1769
The function h (with range [−1, 1]) used in the experiments is h(x) =
2 −1. 1 + exp(−x)
(A.2)
The function g (with range [−2, 2]) used in the experiments is g(x) =
4 −2. 1 + exp(−x)
(A.3)
A.1.1 Forward Pass. The net input and the activation of hidden unit i are X wiu yu (t − 1) (A.4) neti (t) = u
yi (t) = fi (neti (t)) . The net input and the activation of inj are X winj u yu (t − 1) netinj (t) =
(A.5)
u
yinj (t) = finj (netinj (t)) . The net input and the activation of outj are X woutj u yu (t − 1) netoutj (t) =
(A.6)
u
youtj (t) = foutj (netoutj (t)) . cv
The net input netcjv , the internal state scjv , and the output activation y j of the vth memory cell of memory cell block cj are: X wcjv u yu (t − 1) (A.7) netcjv (t) = u
³ ´ scjv (t) = scjv (t − 1) + yinj (t)g netcjv (t) cv
y j (t) = youtj (t)h(scjv (t)) . The net input and the activation of output unit k are X wku yu (t − 1) netk (t) = u: u not a gate yk (t) = fk (netk (t)) . The backward pass to be described later is based on the following truncated backpropagation formulas.
1770
Sepp Hochreiter and Jurgen ¨ Schmidhuber
A.1.2 Approximate Derivatives for Truncated Backpropagation. The truncated version (see section 4) only approximates the partial derivatives, which is reflected by the ≈tr signs in the notation below. It truncates error flow once it leaves memory cells or gate units. Truncation ensures that there are no loops across which an error that left some memory cell through its input or input gate can reenter the cell through its output or output gate. This in turn ensures constant error flow through the memory cell’s CEC. In the truncated backpropagation version, the following derivatives are replaced by zero: ∂netinj (t) ∂yu (t − 1) ∂netoutj (t) ∂yu (t − 1)
≈tr 0 ∀u, ≈tr 0 ∀u,
and ∂netcj (t) ∂yu (t − 1)
≈tr 0 ∀u.
Therefore we get ∂netinj (t) ∂yinj (t) = fin0 j (netinj (t)) u ≈tr 0 ∀u, u ∂y (t − 1) ∂y (t − 1) ∂netoutj (t) ∂youtj (t) 0 = fout ≈tr 0 ∀u, (netoutj (t)) u j ∂yu (t − 1) ∂y (t − 1) and ∂ycj (t) ∂netoutj (t) ∂ycj (t) ∂netinj (t) ∂ycj (t) = + u u ∂y (t − 1) ∂netoutj (t) ∂y (t − 1) ∂netinj (t) ∂yu (t − 1) +
∂ycj (t) ∂netcj (t) ≈tr 0 ∀u. ∂netcj (t) ∂yu (t − 1)
This implies for all wlm not on connections to cjv , inj , outj (that is, l 6∈ {cjv , inj , outj }): cv
cv
∂y j (t) X ∂y j (t) ∂yu (t − 1) = ≈tr 0. u ∂wlm ∂wlm u ∂y (t − 1) The truncated derivatives of output unit k are: ∂yk (t) = fk0 (netk (t)) ∂wlm
Ã
X
u: u not a gate
!
∂yu (t − 1) wku + δkl ym (t − 1) ∂wlm
Long Short-Term Memory
1771
Sj XX
≈tr fk0 (netk (t))
v=1
j
+
X³ j
δinj l + δoutj l X
+
i: i hidden unit
= fk0 (netk (t))
cv
δcjv l wkcjv
Sj ´X
∂y j (t − 1) ∂wlm cv
w
v=1
kcjv
∂y j (t − 1) ∂wlm
!
∂yi (t − 1) wki + δkl ym (t − 1) ∂wlm ym (t − 1)
P
PSj
wkcjv
v=1 wkcj
v
l=k
cv ∂y j (t−1)
∂wlm cv ∂y j (t−1)
∂wlm ∂yi (t−1) w ki i: i hidden unit ∂wlm
l = cjv l = inj OR l = outj l otherwise (A.8)
where δ is the Kronecker delta (δab = 1 if a = b and 0 otherwise), and Sj is the size of memory cell block cj . The truncated derivatives of a hidden unit i that is not part of a memory cell are: ∂neti (t) ∂yi (t) = fi0 (neti (t)) ≈tr δli fi0 (neti (t))ym (t − 1) . ∂wlm ∂wlm
(A.9)
(Here it would be possible to use the full gradient without affecting constant error flow through internal states of memory cells.) Cell block cj ’s truncated derivatives are: ∂netinj (t) ∂yinj (t) = fin0 j (netinj (t)) ∂wlm ∂wlm ≈tr δinj l fin0 j (netinj (t))ym (t − 1) . ∂netoutj (t) ∂youtj (t) 0 = fout (netoutj (t)) j ∂wlm ∂wlm 0 m ≈tr δoutj l fout (net (t))y (t − 1) . outj j ∂scjv (t) ∂wlm
=
∂scjv (t − 1) ∂wlm
´ ´ ∂netcv (t) ³ ∂yinj (t) ³ j g netcjv (t) + yinj (t)g0 netcjv (t) ∂wlm ∂wlm ´ ³ ´ ∂scv (t − 1) ∂yinj (t) ³ j ≈tr δinj l + δcjv l + δinj l g netcjv (t) ∂wlm ∂wlm ´ ∂netcv (t) ³ j + δcjv l yinj (t)g0 netcjv (t) ∂wlm +
(A.10)
(A.11)
1772
Sepp Hochreiter and Jurgen ¨ Schmidhuber
³
=
δinj l + δcjv l
´ ∂scv (t − 1) j
∂wlm
³ ´ + δinj l fin0 j (netinj (t)) g netcjv (t) ym (t − 1) ´ ³ + δcjv l yinj (t) g0 netcjv (t) ym (t − 1) . ∂scjv (t) ∂y (t) ∂youtj (t) = h(scjv (t)) + h0 (scjv (t)) youtj (t) ∂wlm ∂wlm ∂wlm ∂youtj (t) h(scjv (t)) ≈tr δoutj l ∂wlm ´ ³ ∂scjv (t) youtj (t) . + δinj l + δcjv l h0 (scjv (t)) ∂wlm
(A.12)
cjv
(A.13)
To update the system efficiently at time t, the only (truncated) derivatives that need to be stored at time t − 1 are ∂scjv (t − 1) ∂wlm
,
where l = cjv or l = inj . A.1.3 Backward Pass. We will describe the backward pass only for the particularly efficient truncated gradient version of the LSTM algorithm. For simplicity we will use equal signs even where approximations are made according to the truncated backpropagation equations above. The squared error at time t is given by ³
X
E(t) =
´2
tk (t) − yk (t)
,
(A.14)
k: k output unit
where tk (t) is output unit k’s target at time t. Time t’s contribution to wlm ’s gradient-based update with learning rate α is 1wlm (t) = −α
∂E(t) . ∂wlm
(A.15)
We define some unit l’s error at time step t by el (t) := −
∂E(t) . ∂netl (t)
(A.16)
Using (almost) standard backpropagation, we first compute updates for weights to output units (l = k), weights to hidden units (l = i) and weights
Long Short-Term Memory
1773
to output gates (l = outj ). We obtain (compare formulas A.8, A.9, and A.11): ³ ´ l = k (output) : ek (t) = fk0 (netk (t)) tk (t) − yk (t) , l = i (hidden) : ei (t) = fi0 (neti (t))
X
wki ek (t) ,
(A.18)
k: k output unit
l = outj (output gates) : Sj X 0 eoutj (t) = fout (net (t)) h(scjv (t)) out j j v=1
(A.17)
X k: k output unit
wkcjv ek (t) .
(A.19)
For all possible l time t’s contribution to wlm ’s update is 1wlm (t) = α el (t) ym (t − 1) .
(A.20)
The remaining updates for weights to input gates (l = inj ) and to cell units (l = cjv ) are less conventional. We define some internal state scjv ’s error: escv := − j
∂E(t) ∂scjv (t)
= foutj (netoutj (t)) h0 (scjv (t))
X k: k output unit
wkcjv ek (t) .
(A.21)
We obtain for l = inj or l = cjv , v = 1, . . . , Sj −
Sj ∂scjv (t) ∂E(t) X = escv (t) . j ∂wlm ∂wlm v=1
(A.22)
The derivatives of the internal states with respect to weights and the corresponding weight updates are as follows (compare expression A.12): l = inj (input gates) : ∂scjv (t) ∂scjv (t − 1) = + g(netcjv (t)) fin0 j (netinj (t))ym (t − 1) ; ∂winj m ∂winj m
(A.23)
therefore, time t’s contribution to winj m ’s update is (compare expression A.8): 1winj m (t) = α
Sj X v=1
escv (t) j
∂scjv (t) ∂winj m
.
(A.24)
1774
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Similarly we get (compare expression A.12): l = cjv (memory cells) : ∂scjv (t) ∂wcjv m
=
∂scjv (t − 1) ∂wcjv m
+ g0 (netcjv (t)) finj (netinj (t))ym (t − 1) ;
(A.25)
therefore time t’s contribution to wcjv m ’s update is (compare expression A.8): 1wcjv m (t) = αescv (t) j
∂scjv (t) ∂wcjv m
.
(A.26)
All we need to implement for the backward pass are equations A.17 through A.21 and A.23 through A.26. Each weight’s total update is the sum of the contributions of all time steps. A.1.4 Computational Complexity. step is
LSTM’s update complexity per time
O(KH + KCS + HI + CSI) = O(W),
(A.27)
where K is the number of output units, C is the number of memory cell blocks, S > 0 is the size of the memory cell blocks, H is the number of hidden units, I is the (maximal) number of units forward connected to memory cells, gate units and hidden units, and W = KH + KCS + CSI + 2CI + HI = O(KH + KCS + CSI + HI) is the number of weights. Expression A.27 is obtained by considering all computations of the backward pass: equation A.17 needs K steps; A.18 needs KH steps; A.19 needs KSC steps; A.20 needs K(H + C) steps for output units, HI steps for hidden units, CI steps for output gates; A.21 needs KCS steps; A.23 needs CSI steps; A.24 needs CSI steps; A.25 needs CSI steps; A.26 needs CSI steps. The total is K + 2KH + KC + 2KSC + HI + CI + 4CSI steps, or O(KH + KSC + HI + CSI) steps. We conclude that LSTM algorithm’s update complexity per time step is just like BPTT’s for a fully recurrent net. At a given time step, only the 2CSI most recent ∂scjv /∂wlm values from equations A.23 and A.25 need to be stored. Hence LSTM’s storage complexity also is O(W); it does not depend on the input sequence length. A.2 Error Flow. We compute how much an error signal is scaled while flowing back through a memory cell for q time steps. As a by-product, this analysis reconfirms that the error flow within a memory cell’s CEC is indeed constant, provided that truncated backpropagation cuts off error flow trying to leave memory cells (see also section 3.2). The analysis also highlights a
Long Short-Term Memory
1775
potential for undesirable long-term drifts of scj , as well as the beneficial, countermanding influence of negatively biased input gates. Using the truncated backpropagation learning rule, we obtain ∂scj (t − k) ∂scj (t − k − 1)
= 1+
¢ ∂yinj (t − k) ¡ g netcj (t − k) ∂scj (t − k − 1)
¢ ∂netcj (t − k) ¡ + yinj (t − k)g0 netcj (t − k) ∂scj (t − k − 1) # " in u X ∂y j (t − k) ∂y (t − k − 1) = 1+ ∂yu (t − k − 1) ∂scj (t − k − 1) u ¢ ¡ ×g netcj (t − k) ¢ ¡ + yinj (t − k)g0 netcj (t − k) # " X ∂netcj (t − k) ∂yu (t − k − 1) × ∂yu (t − k − 1) ∂scj (t − k − 1) u ≈tr 1.
(A.28)
The ≈tr sign indicates equality due to the fact that truncated backpropagation replaces by zero the following derivatives: ∂yinj (t − k) ∀u ∂yu (t − k − 1)
and
∂netcj (t − k) ∂yu (t − k − 1)
∀u.
In what follows, an error ϑj (t) starts flowing back at cj ’s output. We redefine X wicj ϑi (t + 1) . (A.29) ϑj (t) := i
Following the definitions and conventions of section 3.1, we compute error flow for the truncated backpropagation learning rule. The error occurring at the output gate is ϑoutj (t) ≈tr
∂youtj (t) ∂ycj (t) ϑj (t) . ∂netoutj (t) ∂youtj (t)
(A.30)
The error occurring at the internal state is ϑscj (t) =
∂scj (t + 1) ∂scj (t)
ϑscj (t + 1) +
∂ycj (t) ϑj (t) . ∂scj (t)
Since we use truncated backpropagation we have X wicj ϑi (t + 1); ϑj (t) = i:i no gate and no memory cell
(A.31)
1776
Sepp Hochreiter and Jurgen ¨ Schmidhuber
therefore we get X ∂ϑj (t) ∂ϑi (t + 1) = ≈tr 0 . wicj ∂ϑscj (t + 1) ∂ϑ scj (t + 1) i
(A.32)
Equations A.31 and A.32 imply constant error flow through internal states of memory cells: ∂ϑscj (t) ∂ϑscj (t + 1)
=
∂scj (t + 1) ∂scj (t)
≈tr 1 .
(A.33)
The error occurring at the memory cell input is ϑcj (t) =
∂g(netcj (t))
∂scj (t)
∂netcj (t) ∂g(netcj (t))
ϑscj (t) .
(A.34)
The error occurring at the input gate is ϑinj (t) ≈tr
∂yinj (t) ∂scj (t) ϑs (t) . ∂netinj (t) ∂yinj (t)) cj
(A.35)
A.2.1 No External Error Flow. Errors are propagated back from units l to unit v along outgoing connections with weights wlv . This “external error” (note that for conventional units there is nothing but external error) at time t is ϑve (t) =
∂yv (t) X ∂netl (t + 1) ϑl (t + 1) . ∂netv (t) l ∂yv (t)
(A.36)
We obtain µ ∂yv (t − 1) ∂ϑoutj (t) ∂netoutj (t) ∂ϑve (t − 1) = ∂ϑj (t) ∂netv (t − 1) ∂ϑj (t) ∂yv (t − 1) ¶ ∂ϑcj (t) ∂netcj (t) ∂ϑinj (t) ∂netinj (t) + + ∂ϑj (t) ∂yv (t − 1) ∂ϑj (t) ∂yv (t − 1) ≈tr 0 .
(A.37)
We observe that the error ϑj arriving at the memory cell output is not backpropagated to units v by external connections to inj , outj , cj . A.2.2 Error Flow Within Memory Cells. We now focus on the error backflow within a memory cell’s CEC. This is actually the only type of error flow that can bridge several time steps. Suppose error ϑj (t) arrives at cj ’s output
Long Short-Term Memory
1777
at time t and is propagated back for q steps until it reaches inj or the memory cell input g(netcj ). It is scaled by a factor of ∂ϑv (t − q) , ∂ϑj (t) where v = inj , cj . We first compute ∂ϑscj (t − q) ∂ϑj (t)
≈tr
c
∂y j (t) ∂scj (t) ∂scj (t−q+1) ∂ϑscj (t−q+1) ∂scj (t−q)
∂ϑj (t)
q=0 q>0
.
(A.38)
Expanding equation A.38, we obtain ∂ϑv (t − q) ∂ϑscj (t − q) ∂ϑv (t − q) ≈tr ∂ϑj (t) ∂ϑscj (t − q) ∂ϑj (t) Ã ! 1 ∂s (t − m + 1) Y ∂ycj (t) ∂ϑv (t − q) cj ≈tr ∂ϑscj (t − q) m=q ∂scj (t − m) ∂scj (t) ( g0 (netcj (t − q)yinj (t − q) v = cj (A.39) . ≈tr youtj (t)h0 (scj (t)) g(netcj (t − q) fin0 j (netinj (t − q)) v = inj Consider the factors in the previous equation’s last expression. Obviously, error flow is scaled only at times t (when it enters the cell) and t − q (when it leaves the cell), but not in between (constant error flow through the CEC). We observe: 1. The output gate’s effect is youtj (t) scales down those errors that can be reduced early during training without using the memory cell. It also scales down those errors resulting from using (activating/deactivating) the memory cell at later training stages. Without the output gate, the memory cell might, for instance, suddenly start causing avoidable errors in situations that already seemed under control (because it was easy to reduce the corresponding errors without memory cells). See “Output Weight Conflict” in section 3 and “Abuse Problem and Solution” (section 4.7). 2. If there are large positive or negative scj (t) values (because scj has drifted since time step t − q), then h0 (scj (t)) may be small (assuming that h is a logistic sigmoid). See section 4. Drifts of the memory cell’s internal state scj can be countermanded by negatively biasing the input gate inj (see section 4 and the next point). Recall from section 4 that the precise bias value does not matter much. 3. yinj (t − q) and fin0 j (netinj (t − q)) are small if the input gate is negatively biased (assume finj is a logistic sigmoid). However, the potential sig-
1778
Sepp Hochreiter and Jurgen ¨ Schmidhuber
nificance of this is negligible compared to the potential significance of drifts of the internal state scj . Some of the factors above may scale down LSTM’s overall error flow, but not in a manner that depends on the length of the time lag. The flow will still be much more effective than an exponentially (of order q) decaying flow without memory cells. Acknowledgments Thanks to Mike Mozer, Wilfried Brauer, Nic Schraudolph, and several anonymous referees for valuable comments and suggestions that helped to improve a previous version of this article (Hochreiter and Schmidhuber, 1995). This work was supported by DFG grant SCHM 942/3-1 from Deutsche Forschungsgemeinschaft. References Almeida, L. B. (1987). A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In IEEE 1st International Conference on Neural Networks, San Diego (Vol. 2, pp. 609–618). Baldi, P., & Pineda, F. (1991). Contrastive learning and neural oscillator. Neural Computation, 3, 526–545. Bengio, Y., & Frasconi, P. (1994). Credit assignment through time: Alternatives to backpropagation. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems 6 (pp. 75–82). San Mateo, CA: Morgan Kaufmann. Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166. Cleeremans, A., Servan-Schreiber, D., & McClelland, J. L. (1989). Finite-state automata and simple recurrent networks. Neural Computation, 1, 372–381. de Vries, B., & Principe, J. C. (1991). A theory for neural networks with time delays. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems 3, (pp. 162–168). San Mateo, CA: Morgan Kaufmann. Doya, K. (1992). Bifurcations in the learning of recurrent neural networks. In Proceedings of 1992 IEEE International Symposium on Circuits and Systems (pp. 2777–2780). Doya, K., & Yoshizawa, S. (1989). Adaptive neural oscillator using continuoustime backpropagation learning. Neural Networks, 2, 375–385. Elman, J. L. (1988). Finding structure in time (Tech. Rep. No. CRL 8801). San Diego: Center for Research in Language, University of California, San Diego. Fahlman, S. E. (1991). The recurrent cascade-correlation learning algorithm. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems 3 (pp. 190–196). San Mateo, CA: Morgan Kaufmann.
Long Short-Term Memory
1779
Hochreiter, J. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut fur ¨ Informatik, Lehrstuhl Prof. Brauer, Technische Universit¨at Munchen. ¨ See http://www7.informatik.tu-muenchen.de/˜hochreit. Hochreiter, S., & Schmidhuber, J. (1995). Long short-term memory (Tech. Rep. No. FKI-207-95). Fakult¨at fur ¨ Informatik, Technische Universit¨at Munchen. ¨ Hochreiter, S., & Schmidhuber, J. (1996). Bridging long time lags by weight guessing and “long short-term memory.” In F. L. Silva, J. C. Principe, & L. B. Almeida (Eds.), Spatiotemporal models in biological and artificial systems (pp. 65–72). Amsterdam: IOS Press. Hochreiter, S., & Schmidhuber, J. (1997). LSTM can solve hard long time lag problems. In Advances in neural information processing systems 9. Cambridge, MA: MIT Press. Lang, K., Waibel, A., & Hinton, G. E. (1990). A time-delay neural network architecture for isolated word recognition. Neural Networks, 3, 23–43. Lin, T., Horne, B. G., Tino, P., & Giles, C. L. (1996). Learning long-term dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks, 7, 1329–1338. Miller, C. B., & Giles, C. L. (1993). Experimental comparison of the effect of order in recurrent neural networks. International Journal of Pattern Recognition and Artificial Intelligence, 7(4), 849–872. Mozer, M. C. (1989). A focused back-propagation algorithm for temporal sequence recognition. Complex Systems, 3, 349–381. Mozer, M. C. (1992). Induction of multiscale temporal structure. In J. E. Moody, S. J. Hanson, & R. P. Lippman (Eds.), Advances in neural information processing systems 4 (pp. 275–282). San Mateo, CA: Morgan Kaufmann. Pearlmutter, B. A. (1989). Learning state space trajectories in recurrent neural networks. Neural Computation, 1(2), 263–269. Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks: A survey. IEEE Transactions on Neural Networks, 6(5), 1212–1228. Pineda, F. J. (1987). Generalization of back-propagation to recurrent neural networks. Physical Review Letters, 19(59), 2229–2232. Pineda, F. J. (1988). Dynamics and architecture for neural computation. Journal of Complexity, 4, 216–245. Plate, T. A. (1993). Holographic recurrent networks. In S. J. Hanson, J. D. Cowan, & C. L. Giles ( Eds.), Advances in neural information processing systems 5 (pp. 34– 41). San Mateo, CA: Morgan Kaufmann. Pollack, J. B. (1991). Language induction by phase transition in dynamical recognizers. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems 3 (pp. 619–626). San Mateo, CA: Morgan Kaufmann. Puskorius, G. V., and Feldkamp, L. A. (1994). Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Transactions on Neural Networks, 5(2), 279–297. Ring, M. B. (1993). Learning sequential tasks by incrementally adding higher orders. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems 5 (pp. 115–122). San Mateo, CA: Morgan Kaufmann.
1780
Sepp Hochreiter and Jurgen ¨ Schmidhuber
Robinson, A. J., & Fallside, F. (1987). The utility driven dynamic error propagation network (Tech. Rep. No. CUED/F-INFENG/TR.1). Cambridge: Cambridge University Engineering Department. Schmidhuber, J. (1989). A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science, 1(4), 403–412. Schmidhuber, J. (1992a). A fixed size storage O(n3 ) time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4(2), 243–248. Schmidhuber, J. (1992b). Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2), 234–242. Schmidhuber, J. (1992c). Learning unambiguous reduced sequence descriptions. In J. E. Moody, S. J. Hanson, & R. P. Lippman (Eds.), Advances in neural information processing systems 4 (pp. 291–298). San Mateo, CA: Morgan Kaufmann. Schmidhuber, J. (1993). Netzwerkarchitekturen, Zielfunktionen und Kettenregel. Habilitationsschrift, Institut fur ¨ Informatik, Technische Universit¨at Munchen. ¨ Schmidhuber, J., & Hochreiter, S. (1996). Guessing can outperform many long time lag algorithms (Tech. Rep. No. IDSIA-19-96). Lugano, Switzerland: Instituto Dalle Molle di Studi sull’Intelligenza Artificiale. Silva, G. X., Amaral, J. D., Langlois, T., & Almeida, L. B. (1996). Faster training of recurrent networks. In F. L. Silva, J. C. Principe, & L. B. Almeida (Eds.), Spatiotemporal models in biological and artificial systems (pp. 168–175). Amsterdam: IOS Press. Smith, A. W., & Zipser, D. (1989). Learning sequential structures with the realtime recurrent learning algorithm. International Journal of Neural Systems, 1(2), 125–131. Sun, G., Chen, H., & Lee, Y. (1993). Time warping invariant neural networks. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems 5 (pp. 180–187). San Mateo, CA: Morgan Kaufmann. Watrous, R. L., & Kuhn, G. M. (1992). Induction of finite-state languages using second-order recurrent networks. Neural Computation, 4, 406–414. Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1, 339–356. Williams, R. J. (1989). Complexity of exact gradient computation algorithms for recurrent neural networks (Tech. Rep. No. NU-CCS-89-27). Boston: Northeastern University, College of Computer Science. Williams, R. J. & Peng, J. (1990). An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 4, 491–501. Williams, R. J., & Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin, & D. E. Rumelhart (Eds.), Back-propagation: Theory, architectures and applications. Hillsdale, NJ: Erlbaum.
Received August 28, 1995; accepted February 24, 1997.
Communicated by Robert Jacobs
Factor Analysis Using Delta-Rule Wake-Sleep Learning Radford M. Neal Department of Statistics and Department of Computer Science, University of Toronto, Toronto M5S 1A1, Canada
Peter Dayan Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, 02139 U.S.A.
We describe a linear network that models correlations between real-valued visible variables using one or more real-valued hidden variables—a factor analysis model. This model can be seen as a linear version of the Helmholtz machine, and its parameters can be learned using the wakesleep method, in which learning of the primary generative model is assisted by a recognition model, whose role is to fill in the values of hidden variables based on the values of visible variables. The generative and recognition models are jointly learned in wake and sleep phases, using just the delta rule. This learning procedure is comparable in simplicity to Hebbian learning, which produces a somewhat different representation of correlations in terms of principal components. We argue that the simplicity of wake-sleep learning makes factor analysis a plausible alternative to Hebbian learning as a model of activity-dependent cortical plasticity.
1 Introduction Statistical structure in a collection of inputs can be found using purely local Hebbian learning rules (Hebb, 1949), which capture second-order aspects of the data in terms of principal components (Linsker, 1988; Oja, 1989). This form of statistical analysis has therefore been used to provide a computational account of activity-dependent plasticity in the vertebrate brain (e.g., von der Malsburg, 1973; Linsker, 1986; Miller, Keller, & Stryker, 1989). There are reasons, however, that principal component analysis may not be an adequate model for the generation of cortical receptive fields (e.g., Olshausen & Field, 1996). Furthermore, a Hebbian mechanism for performing principal component analysis would accord no role to the top-down (feedback) connections that always accompany bottom-up connections (Felleman & Van Essen, 1991). Hebbian learning must also be augmented by extra mechanisms in order to extract more than just the first principal comNeural Computation 9, 1781–1803 (1997)
c 1997 Massachusetts Institute of Technology °
1782
Radford M. Neal and Peter Dayan
ponent and in order to prevent the synaptic weights from growing without bound. In this article, we present the statistical technique of factor analysis as an alternative to principal component analysis and show how factor analysis models can be learned using an algorithm whose demands on synaptic plasticity are as local as those of the Hebb rule. Our approach follows the suggestion of Hinton and Zemel (1994) (see also Grenander, 1976–1981; Mumford, 1994; Dayan, Hinton, Neal, & Zemel, 1995; Olshausen & Field, 1996) that the top-down connections in the cortex might be constructing a hierarchical, probabilistic “generative” model, in which dependencies between the activities of neurons reporting sensory data are seen as being due to the presence in the world of hidden or latent “factors.” The role of the bottom-up connections is to implement an inverse “recognition” model, which takes low-level sensory input and instantiates in the activities of neurons in higher layers a set of likely values for the hidden factors, which are capable of explaining this input. These higherlevel activities are meant to correspond to significant features of the external world, and thus to form an appropriate basis for behavior. We refer to such a combination of generative and recognition models as a Helmholtz machine. The simplest form of Helmholtz machine, with just two layers, linear units, and gaussian noise, is equivalent to the factor analysis model, a statistical method that is used widely in psychology and the social sciences as a way of exploring whether observed patterns in data might be explainable in terms of a small number of unobserved factors. Everitt (1984) gives a good introduction to factor analysis and to other latent variable models. Even though factor analysis involves only linear operations and gaussian noise, learning a factor analysis model is not computationally straightforward; practical algorithms for maximum likelihood factor analysis (Joreskog, ¨ 1967, 1969, 1977) took many years to develop. Existing algorithms change the values of parameters based on complex and nonlocal calculations. A general approach to learning in Helmholtz machines that is attractive for its simplicity is the wake-sleep algorithm of Hinton, Dayan, Frey, and Neal (1995). We show empirically in this article that maximum likelihood factor analysis models can be learned by the wake-sleep method, which uses just the purely local delta rule in both of its two separate learning phases. The wake-sleep algorithm has previously been applied to the more difficult task of learning nonlinear models with binary latent variables, with mixed results. We have found that good results are obtained much more consistently when the wake-sleep algorithm is used to learn factor analysis models, perhaps because settings of the recognition model parameters that invert the generative model always exist. In the factor analysis models we look at in this article, the factors are a priori independent, and this simplicity prevents them from reproducing interesting aspects of cortical structure. However, these results contribute to
Factor Analysis Using Delta-Rule Wake-Sleep Learning
1783
the understanding of wake-sleep learning. They also point to possibilities for modeling cortical plasticity, since the wake-sleep approach avoids many of the criticisms that can be leveled at Hebbian learning. 2 Factor Analysis Factor analysis with a single hidden factor is based on a generative model for the distribution of a vector of real-valued visible inputs, x, given by x = µ + gy + ².
(2.1)
Here, y is the single real-valued factor, and is assumed to have a gaussian distribution with mean zero and variance one. The vector of “factor loadings,” g, which we refer to as the generative weights, expresses the way the visible variables are related to the hidden factor. The vector of overall means, µ, will, for simplicity of presentation, be taken to be zero in this article, unless otherwise stated. Finally, the noise, ², is assumed to be gaussian, with a diagonal covariance matrix, 9, which we will write as
0 9=
0
...
τ22
...
0
0
τ12
0
···
0 .
...
τn2
(2.2)
The τj2 are sometimes called the “uniquenesses,” as they represent the portion of the variance in each visible variable that is unique to it rather than being explained by the common factor. We will refer to them as the generative variance parameters (not to be confused with the variance of the hidden factor in the generative model, which is fixed at one). The model parameters, µ, g, and 9, define a joint gaussian distribution for both the hidden factor, y, and the visible inputs, x. In a Helmholtz machine, this generative model is accompanied by a recognition model, which represents the conditional distribution for the hidden factor, y, given a particular input vector, x. This recognition model also has a simple form, which for µ = 0, is y = rT x + ν,
(2.3)
where r is the vector of recognition weights and ν has a gaussian distribution with mean zero and variance σ 2 . It is straightforward to show that the correct recognition model has r = [ggT + 9]−1 g and σ 2 = 1 − gT [ggT + 9]−1 g, but we presume that directly obtaining the recognition model’s parameters in this way is not plausible in a neurobiological model. The factor analysis model can be extended to include several hidden factors, which are usually assumed to have independent gaussian distributions
1784
Radford M. Neal and Peter Dayan
with mean zero and variance one (though we discuss other possibilities in section 5.2). These factors jointly produce a distribution for the visible variables, as follows: x = µ + Gy + ².
(2.4)
Here, y is the vector of hidden factor values, G is a matrix of generative weights (factor loadings), and ² is again a noise vector with diagonal covariance. A linear recognition model can again be found to represent the conditional distribution of the hidden factors for given values of the visible variables. Note that there is redundancy in the definition of the multiplefactor model, caused by the symmetry of the distribution of y in the generative model. A model with generative weights G0 = GUT , where U is any unitary matrix (for which UT U = I), will produce the same distribution for x, with the hidden factors y0 = Uy again being independent, with mean zero and variance one. The presence of multiple solutions is sometimes seen as a problem with factor analysis in statistical applications, since it makes interpretation more difficult, but this is hardly an issue in neurobiological modeling, since easy interpretability of neural activities is not something that we are at liberty to require. Although there are some special circumstances in which factor analysis is equivalent to principal component analysis, the techniques are in general quite different (Jolliffe, 1986). Loosely speaking, principal component analysis pays attention to both variance and covariance, whereas factor analysis looks only at covariance. In particular, if one of the components of x is corrupted by a large amount of noise, the principal eigenvector of the covariance matrix of the inputs will have a substantial component in the direction of that input. Hebbian learning will therefore result in the output, y, being dominated by this noise. In contrast, factor analysis uses ²j to model any noise that affects only component j. A large amount of noise simply results in the corresponding τj2 being large, with no effect on the output y. Principal component analysis is also unaffected by a rotation of the coordinate system of the input vectors, whereas factor analysis privileges the particular coordinate system in which the data are presented, because it is assumed in equation 2.2 that it is in this coordinate system that the components of the noise are independent. 3 Learning Factor Analysis Models with the Wake-Sleep Algorithm In this section, we describe wake-sleep algorithms for learning factor analysis models, starting with the simplest such model, having a single hidden factor. We also provide an intuitive justification for believing that these algorithms might work, based on their resemblance to the ExpectationMaximization (EM) algorithm. We have not found a complete theoretical
Factor Analysis Using Delta-Rule Wake-Sleep Learning
1785
proof that wake-sleep learning for factor analysis works, but we present empirical evidence that it usually does in section 4. 3.1 Maximum Likelihood and the EM Algorithm. In maximum likelihood learning, the parameters of the model are chosen to maximize the probability density assigned by the model to the data that were observed (the “likelihood”). For a factor analysis model with a single factor, the likelihood, L, based on the observed data in n independent cases, x(1) , . . . , x(n) , is obtained by integrating over the possible values that the hidden factors, y(c) , might take on for each case: L(g, 9) =
n Y
p(x(c) | g, 9) =
c=1
n Z Y
p(x(c) | y(c) , g, 9) p(y(c) ) dy(c) (3.1)
c=1
where g and 9 are the model parameters, as described previously, and p(·) is used to write probability densities and conditional probability densities. The prior distribution for the hidden factor, p(y(c) ), is gaussian with mean zero and variance one. The conditional distribution for x(c) given y(c) is gaussian with mean µ + gy(c) and covariance 9, as implied by equation 2.1. Wake-sleep learning can be viewed as an approximate implementation of the EM algorithm, which Dempster, Laird, and Rubin (1977) present as a general approach to maximum likelihood estimation when some variables (in our case, the values of the hidden factors) are unobserved. Applications of EM to factor analysis are discussed by Dempster et al. (1977) and by Rubin and Thayer (1982), who find that EM produces results almost identical to those of Joreskog’s ¨ (1969) method, including getting stuck in essentially the same local maxima given the same starting values for the parameters.1 EM is an iterative method, in which each iteration consists of two steps. In the E-step, one finds the conditional distribution for the unobserved variables given the observed data, based on the current estimates for the model parameters. When the cases are independent, this conditional distribution factors into distributions for the unobserved variables in each case. For the single-factor model, the distribution for the hidden factor, y(c) , in a case with observed data x(c) , is p(y(c) | x(c) , g, 9) = R
p(y(c) , x(c) | g, 9) . p(x(c) | y, g, 9) p(y) dy
(3.2)
1 It may seem surprising that maximum likelihood factor analysis can be troubled by local maxima, in view of the simple linear nature of the model, but this is in fact quite possible. For example, a local maximum can arise if a single-factor model is applied to data consisting of two pairs of correlated variables. The single factor can capture only one of these correlations. If the initial weights are appropriate for modeling the weaker of the two correlations, learning may never find the global maximum in which the single factor is used instead to model the stronger correlation.
1786
Radford M. Neal and Peter Dayan
In the M-step, one finds new parameter values, g0 and 9 0 , that maximize (or at least increase) the expected log likelihood of the complete data, with the unobserved variables filled in according to the conditional distribution found in the E-step: n Z X
h i p(y(c) | x(c) , g, 9) log p(x(c) | y(c) , g0 , 9 0 ) p(y(c) ) dy(c) .
(3.3)
c=1
For factor analysis, the conditional distribution for y(c) in equation 3.2 is gaussian, and the quantity whose expectation with respect to this distribution is required above is quadratic in y(c) . This expectation is therefore easily found on a computer. However, the matrix calculations involved require nonlocal operations. 3.2 The Wake-Sleep Approach. To obtain a local learning procedure, we first eliminate the explicit maximization in the M-step of the quantity defined in equation 3.3 in terms of an expectation. We replace this by a gradient-following learning procedure in which the expectation is implicitly found by combining many updates based on values for the y(c) that are stochastically generated from the appropriate conditional distributions. We furthermore avoid the direct computation of these conditional distributions in the E-step by learning to produce them using a recognition model, trained in tandem with the generative model. This approach results in the wake-sleep learning procedure, with the wake phase playing the role of the M-step in EM and the sleep phase playing the role of the E-step. The names for these phases are metaphorical; we are not proposing neurobiological correlates for the wake and sleep phases. Learning consists of interleaved iterations of these two phases, which in the context of factor analysis operate as follows: Wake phase: From observed values for the visible variables, x, randomly fill in values for the hidden factors, y, using the conditional distribution defined by the current recognition model. Update the parameters of the generative model to make this filled-in case more likely. Sleep phase: Without reference to any observation, randomly choose values for the hidden factors, y, from their fixed generative distribution, and then randomly choose “fantasy” values for the visible variables, x, from their conditional distribution given y, as defined by the current parameters of the generative model. Update the parameters of the recognition model to make this fantasy case more likely. The wake phase of learning will correctly implement the M-step of EM (in a stochastic sense), provided that the recognition model produces the correct conditional distribution for the hidden factors. The aim of sleep phase learning is to improve the recognition model’s ability to produce this
Factor Analysis Using Delta-Rule Wake-Sleep Learning
1787
conditional distribution, which would be computed in the E-step of EM. If values for the recognition model parameters exist that reproduce this conditional distribution exactly and if learning of the recognition model proceeds at a much faster rate than changes to the generative model, so that this correct distribution can continually be tracked, then the sleep phase will effectively implement the E-step of EM, and the wake-sleep algorithm as a whole will be guaranteed to find a (local) maximum of the likelihood (in the limit of small learning rates, so that stochastic variation is averaged out). These conditions for wake-sleep learning to mimic EM are too stringent for actual applications, however. At a minimum, we would like wake-sleep learning to work for a wide range of relative learning rates in the two phases, not just in the limit as the ratio of the generative learning rate to the recognition learning rate goes to zero. We would also like wake-sleep learning to do something sensible even when it is not possible for the recognition model to invert the generative model perfectly (as is typical in applications other than factor analysis), though one would not usually expect the method to produce exact maximum likelihood estimates in such a case. We could guarantee that wake-sleep learning behaves well if we could find a cost function that is decreased (on average) by both the wake phase and the sleep phase updates. The cost would then be a Lyapunov function for learning, providing a guarantee of stability and allowing something to be said about the stable states. Unfortunately, no such cost function has been discovered, for either wake-sleep learning in general or wake-sleep learning applied to factor analysis. The wake and sleep phases each separately reduces a sensible cost function (for each phase, a Kullback-Leibler divergence), but these two cost functions are not compatible with a single global cost function. An algorithm that correctly performs stochastic gradient descent in the recognition parameters of a Helmholtz machine using an appropriate global cost function does exist (Dayan & Hinton, 1996), but it involves reinforcement learning methods for which convergence is usually extremely slow. We have obtained some partial theoretical results concerning wake-sleep learning for factor analysis, which show that the maximum likelihood solutions are second-order Lyapunov stable and that updates of the weights (but not variances) for a single-factor model decrease an appropriate cost function. However, the primary reason for thinking that the wake-sleep algorithm generally works well for factor analysis is not these weak theoretical results but rather the empirical results in section 4. Before presenting these, we discuss in more detail how the general wake-sleep scheme is applied to factor analysis, first for a single-factor model and then for models with multiple factors. 3.3 Wake-Sleep Learning for a Single-Factor Model. A Helmholtz machine that implements factor analysis with a single hidden factor is shown
1788
Radford M. Neal and Peter Dayan
Hidden Factor y
Generative Connections
g1
g2 r1
x1
Recognition Connections
g3 r2
x2
g4 r3 x3
r4 x4
Visible Variables
Figure 1: The one-factor linear Helmholtz machine. Connections of the generative model are shown using solid lines, those of the recognition model using dashed lines. The weights for these connections are given by the gj and the rj .
in Figure 1. The connections for the linear network that implements the generative model of equation 2.1 are shown in the figure using solid lines. Generation of data using this network starts with the selection of a random value for the hidden factor, y, from a gaussian distribution with mean zero and variance one. The value for the jth visible variable, xj , is then found by multiplying y by a connection weight, gj , and adding random noise with variance τj2 . (A bias value, µj , could be added as well, to produce nonzero means for the visible variables.) The recognition model’s connections are shown in Figure 1 using dashed lines. These connections implement equation 2.3. When presented with values for the visible input variables, xj , the recognition network produces a value for the hidden factor, y, by forming a weighted sum of the inputs, with the weight for the jth input being rj , and then adding gaussian noise with mean zero and variance σ 2 . (If the inputs have nonzero means, the recognition model would include a bias for the hidden factor as well.) The parameters of the generative model are learned in the wake phase, as follows. Values for the visible variables, x(c) , are obtained from the external world; they are set to a training case drawn from the distribution that we wish to model. The current version of the recognition network is then used to stochastically fill in a corresponding value for the hidden factor, y(c) , using equation 2.3. The generative weights are then updated using the delta rule,
Factor Analysis Using Delta-Rule Wake-Sleep Learning
1789
as follows: gj0 = gj + η (xj(c) − gj y(c) ) y(c) ,
(3.4)
where η is a small positive learning rate parameter. The generative variances, τj2 , which make up 9, are also updated, using an exponentially weighted moving average: (τj2 )0 = α τj2 + (1 − α) (xj(c) − gj y(c) )2 ,
(3.5)
where α is a learning rate parameter that is slightly less than one. If the recognition model correctly inverted the generative model, these wake-phase updates would correctly implement the M-step of the EM algorithm (in the limit as η → 0 and α → 1). The update for gj in equation 3.4 improves the expected log likelihood because the increment to gj is proportional to the derivative of the log likelihood for the training case with y filled in, which is as follows: ´ ∂ ∂ ³ log p(x(c) , y(c) | g, 9) = ∂gj ∂gj =
Ã
1 − 2 (xj(c) − gj y(c) )2 2τj
!
1 (c) (x − gj y(c) ) y(c) . τj2 j
(3.6) (3.7)
The averaging operation by which the generative variances, τj2 , are learned will (for α close to one) also lead toward the maximum likelihood values based on the filled-in values for y. The parameters of the recognition model are learned in the sleep phase, based not on real data but on “fantasy” cases, produced using the current version of the generative model. Values for the hidden factor, y( f ) , and the visible variables, x( f ) , in a fantasy case are stochastically generated according to equation 2.1, as described at the beginning of this section. The connection weights for the recognition model are then updated using the delta rule, as follows: (f)
rj0 = rj + η (y( f ) − rT x( f ) ) xj ,
(3.8)
where η is again a small positive learning rate parameter, which might or might not be the same as that used in the wake phase. The recognition variance, σ 2 , is updated as follows: (σ 2 )0 = α σ 2 + (1 − α) (y( f ) − rT x( f ) )2 , where α is again a learning rate parameter slightly less than one.
(3.9)
1790
Radford M. Neal and Peter Dayan
These updates are analogous to those made in the wake phase and have the effect of improving the recognition model’s ability to produce the correct distribution for the hidden factor. However, as noted in section 3.2, the criterion by which the sleep phase updates “improve” the recognition model does not correspond to a global Lyapunov function for wake-sleep learning as a whole, and we therefore lack a theoretical guarantee that the wake-sleep procedure will be stable and will converge to a maximum of the likelihood, though the experiments in section 4 indicate that it usually does. 3.4 Wake-Sleep Learning for Multiple-Factor Models. A Helmholtz machine with more than one hidden factor can be trained using the wakesleep algorithm in much the same way as described above for a single-factor model. The generative model is simply extended by including more than one hidden factor. In the most straightforward case, the values of these factors are chosen independently at random when a fantasy case is generated. However, a new issue arises with the recognition model. When there are k hidden factors and p visible variables, the recognition model has the following form (assuming zero means): y = Rx + ν
(3.10)
where the k × p matrix R contains the weights on the recognition connections, and ν is a k-dimensional gaussian random vector with mean 0 and covariance matrix 6, which in general (i.e., for some arbitrary generative model) will not be diagonal. Generation of a random ν with such covariance can easily be done on a computer using the Cholesky decomposition of 6, but in a method intended for consideration as a neurobiological model, we would prefer a local implementation that produces the same effect. One way of producing arbitrary covariances between the hidden factors is to include a set of connections in the recognition model that link each hidden factor to hidden factors that come earlier (in some arbitrary ordering). A Helmholtz machine with this architecture is shown in Figure 2. During the wake phase, values for the hidden factors are filled in sequentially by random generation from gaussian distributions. The mean of the distribution used in picking a value for factor yi is the weighted sum of inputs along connections from the visible variables and from the hidden factors whose values were chosen earlier. The variance of the distribution for factor yi is an additional recognition model parameter, σi2 . The k(k−1)/2 connections between the hidden factors and the k variances associated with these factors together make up the k(k+1)/2 independent degrees of freedom in 6. Accordingly, a recognition model of this form exists that can perfectly invert any generative model.2 2
Another way of seeing this is to note that the joint distribution for the visible vari-
Factor Analysis Using Delta-Rule Wake-Sleep Learning
Correlation-inducing Recognition Connections
1791
y3 Recognition Connections
y2
Generative
y1
Connections
x1
x2
x3
x4
Figure 2: A Helmholtz machine implementing a model with three hidden factors, using a recognition model with a set of correlation-inducing connections. The figure omits some of the generative connections from hidden factors to visible variables and some of the recognition connections from visible variables to hidden factors (indicated by the ellipses). Note, however, that the only connections between hidden factors are those shown, which are part of the recognition model. There are no generative connections between hidden factors.
This method is reminiscent of some of the proposals to allow Hebbian learning to extract more than one principal component (e.g., Sanger, 1989; Plumbley, 1993); various of these also order the “hidden” units. In these proposals, the connections are used to remove correlations between the units so that they can represent different facets of the input. However, for the Helmholtz machine, the factors come to represent different facets of the input because they are jointly rather than separately engaged in capturing the statistical regularities in the input. The connections between the hidden units capture the correlations in the distribution for the hidden factors con-
ables and hidden factors is multivariate gaussian. From general properties of multivariate gaussians, the conditional distribution for a hidden factor given values for the visible variables and for the earlier hidden factors must also be gaussian, with some constant variance and with a mean given by some linear function of the other values.
1792
Radford M. Neal and Peter Dayan
ditional on given values for the visible variables that may be induced by this joint responsibility for modeling the visible variables. Another approach is possible if the covariance matrix for the hidden factors, y, in the generative model is rotationally symmetric, as is the case when it is the identity matrix, as we have assumed so far. We may then freely rotate the space of factors (y0 = Uy, where U is a unitary matrix), making corresponding changes to the generative weights (G0 = GUT ), without changing the distribution of the visible variables. When the generative model is rotated, the corresponding recognition model will also rotate. There always exist rotations in which the recognition covariance matrix, 6, is diagonal and can therefore be represented by just k variances, σi2 , one for each factor. This amounts to forcing the factors to be independent, or factorial, which has itself long been suggested as a goal for early cortical processing (Barlow, 1989) and has generally been assumed in nonlinear versions of the Helmholtz machine. We can therefore hope to learn multiple-factor models using wake-sleep learning in exactly the same way as we learn single-factor models, with equations 2.4 and 3.10 specifying how the values of all the factors y combine to predict the input x, and vice versa. As seen in the next section, such a Helmholtz machine with only the capacity to represent k recognition variances, with no correlation-inducing connections, is usually able to find a rotation in which such a recognition model is sufficient to invert the generative model. Note that there is no counterpart in Hebbian learning of this second approach, in which there are no connections between hidden factors. If Hebbian units are not connected in some manner, they will all extract the same single principal component of the inputs. 4 Empirical Results We have run a number of experiments on synthetic data in order to test whether the wake-sleep algorithm applied to factor analysis finds parameter values that at least locally maximize the likelihood, in the limit of small values for the learning rates. These experiments also provide data on how small the learning rates must be in practice and reveal situations in which learning (particularly of the uniquenesses) can be relatively slow. We also report in section 4.4 the results of applying the wake-sleep algorithm to a real data set used by Everitt (1984). Away from conditions in which the maximum likelihood model is not well specified, the wake-sleep algorithm performs quite competently. 4.1 Experimental Procedure. All the systematic experiments described were done with randomly generated synthetic data. Models for various numbers (p) of visible variables, using various numbers (k) of hidden factors, were tested. For each such model, 10 sets of model parameters were generated, each of which was used to generate 2 sets of training data, the
Factor Analysis Using Delta-Rule Wake-Sleep Learning
1793
first with 10 cases and the second with 500 cases. Models with the same p and k were then learned from these training sets using the wake-sleep algorithm. The models used to generate the data were randomly constructed as follows. First, initial values for generative variances (the “uniquenesses”) were drawn independently from exponential distributions with mean one, and initial values for the generative weights (the “factor loadings”) were drawn independently from gaussian distributions with mean zero and variance one. These parameters were then rescaled so as to produce a variance of one for each visible variable; that is, for a single-factor model, the new generative variances were set to τj02 = fj2 τj2 and the new generative weights to gj0 = fj gj , with the fj chosen such that the new variances for the visible
variables, τj02 + gj02 , were equal to one. In all experiments, the wake-sleep learning procedure was started with the generative and recognition variances set to one and the generative and recognition weights and biases set to zero. Symmetry is broken by the stochastic nature of the learning procedure. Learning was done online, with cases from the training set being presented one at a time, in a fixed sequence. The learning rates for both generative and recognition models were usually set to η = 0.0002 and α = 0.999, but other values of η and α were investigated as well, as described below. In the models used to generate the data, the variables all had mean zero and variance one, but this knowledge was not built into the learning procedure. Bias parameters were therefore present, allowing the network to learn the means of the visible variables in the training data (which were close to zero, but not exactly zero, due to the use of a finite training set). We compared the estimates produced by the wake-sleep algorithm with the maximum likelihood estimates based on the same training sets produced using the “factanal” function in S-Plus (Version 3.3, Release 1), which uses a modification of Joreskog’s ¨ (1977) method. We also implemented the EM factor analysis algorithm to examine in more detail cases in which wake-sleep disagreed with S-Plus. As with most stochastic learning algorithms, using fixed learning rates implies that convergence can at best be to a distribution of parameter values that is highly concentrated near some stable point. One would in general have to reduce the learning rates over time to ensure stronger forms of convergence. The software used in these experiments may be obtained over the Internet.3 4.2 Experiments with Single-Factor Models. We first tried using the wake-sleep algorithm to learn single-factor models for three (p = 3) and six (p = 6) visible variables. Figure 3 shows the progress of learning for a 3 Follow the links from the first author’s home page: http://www.cs.utoronto.ca/ ∼radford/
Radford M. Neal and Peter Dayan
1.0
1.2
1794
0.4
6 4
0.5 0.0
2 5
-0.5
0.6
1
Generative Weights
0.8
2 5
1 6 4
3 0.0
0.5
1.0
1.5
2.0
-1.0
0.0
0.2
Generative Variances
1.0
3
0.0
0.5
1.0
1.5
2.0
Figure 3: Wake-sleep learning of a single-factor model for six variables. The graphs show the progress of the generative variances and weights over the course of 2 million presentations of input vectors, drawn sequentially from a training set of size 500, with learning parameters of η = 0.0002 and α = 0.999 being used for both the wake and the sleep phases. The training set was randomly generated from a single-factor model whose parameters were picked at random, as described in section 4.1. The maximum likelihood parameter estimates found by S-Plus are shown as horizontal lines.
typical run with p = 6, applied to a training set of size 500. The figure shows progress over 2 million presentations of input vectors (4000 presentations of each of the 500 training cases). Both the generative variances and the generative weights are seen to converge to the maximum likelihood estimates found by S-Plus, with a small amount of random variation, as is expected with a stochastic learning procedure. Of the 20 runs with p = 6 (based on data generated from 10 random models, with training sets of size 10 and 500), all but one showed similar convergence to the S-Plus estimates within 3 million presentations (and usually much earlier). The remaining run (on a training set of size 10) converged to a different local maximum, which S-Plus found when its maximization routine was started from the values found by wake-sleep. Convergence to maximum likelihood estimates was sometimes slower when there were only three variables (p = 3). This is the minimum number of visible variables for which the single-factor model is identifiable (that is, for which the true values of the parameters can be found given enough data, apart from an ambiguity in the overall signs of the weights). Three of the 10 runs with training sets of size 10 and one of the 10 runs with training sets of size 500 failed to converge clearly within 3 million presentations,
Factor Analysis Using Delta-Rule Wake-Sleep Learning
1795
but convergence was seen when these runs were extended to 10 million presentations (in one case to a different local maximum than initially found by S-Plus). The slowest convergence was for a training set of size 10 for which the maximum likelihood estimate for one of the generative weights was close to zero (0.038), making the parameters nearly unidentifiable. All the runs were done with learning rates of η = 0.0002 and α = 0.999, for both the generative and recognition models. Tests on the training sets with p = 3 were also done with learning for the recognition model slowed to η = 0.00002 and α = 0.9999—a situation that one might speculate would cause problems, due to the recognition model’s not being able to keep up with the generative model. No problems were seen (though the learning was, of course, slower). 4.3 Experiments with Multiple-Factor Models. We have also tried using the wake-sleep algorithm to learn models with two and three hidden factors (k = 2 and k = 3) from synthetic data generated as described in section 4.1. Systematic experiments were done for k = 2, p = 5, using models with and without correlation-inducing recognition weights, and for k = 2, p = 8 and k = 3, p = 9, in both cases using models with no correlationinducing recognition weights. For most of the experiments, learning was done with η = 0.0002 and α = 0.999. The behavior of wake-sleep learning with k = 2 and p = 5 was very similar for the model with correlation-inducing recognition connections and the model without such connections. The runs on most of the 20 data sets converged within the 6 million presentations that were initially performed; a few required more iterations before convergence was apparent. For two data sets (both of size 10), wake-sleep learning converged to different local maxima than S-Plus. For one training set of size 500, the maximum likelihood estimate for one of the generative variances is very close to one, making the model almost unidentifiable (p = 5 being the minimum number of visible variables for identifiability with k = 2); this produced an understandable difficulty with convergence, though the wake-sleep estimates still agreed fairly closely with the maximum likelihood values found by S-Plus. In contrast with these good results, a worrying discrepancy arose with one of the training sets of size 10, for which the two smallest generative variances found using wake-sleep learning differed somewhat from the maximum likelihood estimates found by S-Plus (one of which was very close to zero). When S-Plus was started at the values found using wakesleep, it did not find a similar local maximum; it simply found the same estimates as it had found with its default initial values. However, when the full EM algorithm was started at the estimates found by wake-sleep, it barely changed the parameters for many iterations. One explanation could be that there is a local maximum in this vicinity, but it is for some reason not found by S-Plus. Another possible explanation could be that the likelihood is extremely flat in this region. In neither case would the discrepancy be cause
1796
Radford M. Neal and Peter Dayan
for much worry regarding the general ability of the wake-sleep method to learn these multiple-factor models. However, it is also possible that this is an instance of the problem that arose more clearly in connection with the Everitt crime data, as reported in section 4.4. Runs using models without correlation-inducing recognition connections were also performed for data sets with k = 2 and p = 8. All runs converged, most within 6 million presentations, in one case to a different local maximum than S-Plus. We also tried runs with the same model and data sets using higher learning rates (η = 0.002 and α = 0.99). The higher learning rates produced both higher variability and some bias in the parameter estimates, but for most data sets the results were still generally correct. Finally, runs using models without correlation-inducing recognition connections were performed for data sets with k = 3 and p = 9. Most of these converged fine. However, for two data sets (both of size 500), a small but apparently real difference was seen between the estimates for the two smallest generative variances found using wake-sleep and the maximum likelihood estimates found using S-Plus. As was the case with the similar situation with k = 2, p = 5, S-Plus did not converge to a local maximum in this vicinity when started at the wake-sleep estimates. As before, however, the EM algorithm moved very slowly when started at the wake-sleep estimates, which is one possible explanation for wake-sleep having apparently converged to this point. However, it is also possible that something more fundamental prevented convergence to a local maximum of the likelihood, as discussed in connection with similar results in the next section. 4.4 Experiments with the Everitt Crime Data. We also tried learning a two-factor model for a data set used as an example by Everitt (1984), in which the visible variables are the rates for seven types of crime, with the cases being 16 American cities. The same learning procedure (with η = 0.0002 and α = 0.999) was used as in the experiments above, except that the visible variables were normalized to have mean zero and variance one, and bias parameters were accordingly omitted from both the generative and recognition models. Fifteen runs with different random number seeds were done, for both models with a correlation-inducing recognition connection between the two factors and without such a connection. All of these runs produced results fairly close to the maximum likelihood estimates found by S-Plus (which match the results of Everitt). In some runs, however, there were small, but clearly real, discrepancies, most notably in the smallest two generative variances, for which the wake-sleep estimates were sometimes nearly zero, whereas the maximum likelihood estimate is zero for only one of them. This behavior is similar to that seen in the three runs where discrepancies were found in the systematic experiments of section 4.3. These discrepancies arose much more frequently when a correlationinducing recognition connection was not present (14 out of 15 runs) than
Factor Analysis Using Delta-Rule Wake-Sleep Learning
1797
when such a connection was included (6 out of 15 runs). Figure 4a shows one of the runs with discrepancies for a model with a correlation-inducing connection present. Figure 4b shows a run differing from that of 4a only in its random seed, but which did find the maximum likelihood estimates. The sole run in which the model without a correlation-inducing recognition connection converged to the maximum likelihood estimates is shown in Figure 4c. Close comparison of Figures 4b and 4c shows that even in this run, convergence was much more labored without the correlation-inducing connection. (In particular, the generative variance for variable 6 approaches zero rather more slowly.) According to S-Plus, the solution found in the discrepant runs is not an alternative local maximum. Furthermore, extending the runs to many more iterations or reducing the learning rates by a large factor does not eliminate the problem. This makes it seem unlikely that the problem is merely slow convergence due to the likelihood of being nearly flat in this vicinity, although, on the other hand, when EM is started at the discrepant wakesleep estimates, its movement toward the maximum likelihood estimates is rather slow (after 200 iterations, the estimate for the generative variance for variable 4 had moved only about a third of the way from the wake-sleep estimate to the maximum likelihood estimate). It seems most likely therefore that the runs with discrepancies result from a local “basin of attraction” for wake-sleep learning that does not lead to a local maximum of the likelihood. The discrepancies can be eliminated for the model with a correlationinducing connection in either of two ways. One way is to reduce the learning rate for the generative parameters (in the wake phase), while leaving the recognition learning rate unchanged. As discussed in section 3.2, there is a theoretical reason to think that this method will lead to the maximum likelihood estimates, since the recognition model will then have time to learn how to invert the generative model perfectly. When the generative learning rates are reduced by setting η = 0.00005 and α = 0.99975, the maximum likelihood estimates are indeed found in eight of eight test runs. The second solution is to impose a constraint that prevents the generative variances from falling below 0.01. This also worked in eight of eight runs. However, these two methods produce little or no improvement when the correlationinducing connection is omitted from the recognition model (no successes in eight runs with smaller generative learning rates; three successes in eight runs with a constraint on the generative variances). Thus, although models without correlation-inducing connections often work well, it appears that learning is sometimes easier and more reliable when they are present. Finally, we tried learning a model for this data without first normalizing the visible variables to have mean zero and variance one, but with biases included in the generative and recognition models to handle the nonzero means. This is not a very sensible thing to do; the means of the variables in this data set are far from zero, so learning would at best take a long time, while the biases slowly adjusted. In fact, however, wake-sleep learning fails
Radford M. Neal and Peter Dayan
0.8 0.6
3 1
0.4
7
0.2
5 2
4 6
0.0
Generative Variances
1.0
1798
0
5
10
15
0.8 0.6
3 1
0.4
7
0.2
5 2
4 6
0.0
Generative Variances
1.0
(a)
0
5
10
15
0.8 0.6
3 1
0.4
7
0.2
5 2
4 6
0.0
Generative Variances
1.0
(b)
0
5
10
15
(c)
Figure 4: Wake-sleep learning of two-factor models for the Everitt crime data. The horizontal axis shows the number of presentations in millions. The vertical axis shows the generative variances, with the maximum likelihood values indicated by horizontal lines. (a) One of the six runs in which a model with a correlation-inducing recognition connection did not converge to the maximum likelihood estimates. (b) One of the nine such runs that did find the maximum likelihood estimates. (c) The only run in which the maximum likelihood estimates were found using a model without a correlation-inducing connection.
Factor Analysis Using Delta-Rule Wake-Sleep Learning
1799
rather spectacularly. The generative variances immediately become quite large. As soon as the recognition weights depart significantly from zero, the recognition variances also become quite large, at which point positive feedback ensues, and all the weights and variances diverge. The instability that can occur once the generative and recognition weights and variances are too large appears to be a fundamental aspect of wake-sleep dynamics. Interestingly, however, it seems that the dynamics may operate to avoid this unstable region of parameter space, since the instability does not occur with the Everitt data if the learning rate is set very low. With a larger learning rate, however, the stochastic aspect of the learning can produce a jump into the unstable region. 5 Discussion We have shown empirically that wake-sleep learning, which involves nothing more than two simple applications of the local delta rule, can be used to implement the statistical technique of maximum likelihood factor analysis. However, just as it is usually computationally more efficient to implement principal component analysis using a standard matrix technique such as singular-value decomposition rather than by using Hebbian learning, factor analysis is probably better implemented on a computer using either EM (Rubin & Thayer, 1982) or the second-order Newton methods of Joreskog ¨ (1967, 1969, 1977) than by the wake-sleep algorithm. In our view, wake-sleep factor analysis is interesting as a simple and successful example of wakesleep learning and as a possible model of activity-dependent plasticity in the cortex. 5.1 Implications for Wake-Sleep Learning. The experiments in section 4 show that, in most situations, wake-sleep learning applied to factor analysis leads to estimates that (locally) maximize the likelihood, when the learning rate is set to be small enough. In a few situations, the parameter estimates found using wake-sleep deviated slightly from the maximum likelihood values. More work is needed to determine exactly when and why this occurs, but even in these situations, the estimates found by wake-sleep are reasonably good and the likelihoods are close to their maxima. The sensitivity of learning to settings of parameters such as learning rates appears no greater than is typical for stochastic gradient descent methods. In particular, setting the learning rate for the generative model to be equal to or greater than that for the recognition model did not lead to instability, even though this is the situation in which one might worry that the algorithm would no longer mimic EM (as discussed in section 3.2). However, we have been unable to prove in general that the wake-sleep algorithm for factor analysis is guaranteed to converge to the maximum likelihood solution. Indeed, the empirical results show that any general theoretical proof of correctness would have to contain caveats regarding unstable regions of the parameter
1800
Radford M. Neal and Peter Dayan
space. Such a proof may be impossible if the discrepancies seen in the experiments are indeed due to a false basin of attraction (rather than being an effect of finite learning rates). Despite the lack so far of strong theoretical results, the relatively simple factor analysis model may yet provide a good starting point for a better theoretical understanding of wake-sleep learning for more complex models. One empirical finding was that it is possible to learn multiple-factor models even when correlation-inducing recognition connections are absent, although the lack of such connections did cause difficulties in some cases. A better understanding of what is going on in this respect would provide insight into wake-sleep learning for more complex models, in which the recognition model will seldom be able to invert the generative model perfectly. Such a complex generative model may allow the data to be modeled in many nearly equivalent ways, for some of which the generative model is harder to invert than for others. Good performance may sometimes be possible only if wake-sleep learning favors the more easily inverted generative models. We have seen that this does usually happen for factor analysis. 5.2 Implications for Activity-Dependent Plasticity. Much progress has been made using Hebbian learning to model the development of structures in early cortical areas, such as topographic maps (Willshaw & von der Malsburg 1976, 1979; von der Malsburg, & Willshaw, 1977), ocular dominance stripes (Miller et al., 1989), and orientation domains (Linkser, 1988; Miller, 1994). However, Hebbian learning suffers from three problems that the equally-local wake-sleep algorithm avoids. First, because of the positive feedback inherent in Hebbian learning, some form of synaptic normalization is required to make it work (Miller & MacKay, 1994). There is no evidence for synaptic normalization in the cortex. This is not an issue for Helmholtz machines because building a statistical model of the inputs involves negative feedback instead, as in the delta rule. Second, in order for Hebbian learning to produce several outputs that represent more than just the first principal component of a collection of inputs, there must be connections of some sort between the output units, which force them to be decorrelated. Typically, some form of anti-Hebbian learning is required for these connections (see Sanger, 1989; Foldi´ ¨ ak, 1989; Plumbley, 1993). The common alternative of using fixed lateral connections between the output units is not informationally efficient. In the Helmholtz machine, the goal of modeling the distribution of the inputs forces output units to differentiate rather than perform the same task. This goal also supplies a clear interpretation in terms of finding the hidden causes of the input. Third, Hebbian learning does not accord any role to the prominent cortical feature that top-down connections always accompany bottom-up connections. In the Helmholtz machine, these top-down connections play a
Factor Analysis Using Delta-Rule Wake-Sleep Learning
1801
crucial role in wake-sleep learning. Furthermore, the top-down connections come to embody a hierarchical generative model of the inputs, some form of which appears necessary in any case to combine top-down and bottom-up processing during inference. Of course, the Helmholtz machine suffers from various problems itself. Although learning rules equivalent to the delta rule are conventional in classical conditioning (Rescorla & Wagner, 1972; Sutton & Barto, 1981) and have also been suggested as underlying cortical plasticity (Montague & Sejnowski, 1994), it is not clear how cortical microcircuitry might construct the required predictions (such as gj y(c) of equation 3.4) or prediction errors (such as xj(c) − gj y(c) of the same equation). The wake-sleep algorithm also requires two phases of activation, with different connections being primarily responsible for driving the cells in each phase. Although there is some suggestive evidence of this (Hasselmo & Bower, 1993), it has not yet been demonstrated. There are also alternative developmental methods that construct top-down statistical models of their inputs using only one, slightly more complicated, phase of activation (Olshausen & Field, 1996; Rao & Ballard, 1995). In Hebbian models of activity-dependent development, a key role is played by lateral local intracortical connections (longer-range excitatory connections are also present but develop later). Lateral connections are not present in the linear Helmholtz machines we have so far described, but they could play a role in inducing correlations between the hidden factors in the generative model, in the recognition model, or in both, perhaps replacing the correlation-inducing connections shown in Figure 2. Purely linear models, such as those presented here, will not suffice to explain fully the intricacies of information processing in the cortical hierarchy. However, understanding how a factor analysis model can be learned using simple and local operations is a first step to understanding how more complex statistical models can be learned using more complex architectures involving nonlinear elements and multiple layers. Acknowledgments We thank Geoffrey Hinton for many useful discussions. This research was supported by the Natural Sciences and Engineering Research Council of Canada and by grant R29 MH55541-01 from the National Institute of Mental Health of the United States. All opinions expressed are those of the authors. References Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Dayan, P., & Hinton, G. E. (1996). Varieties of Helmholtz machine. Neural Networks, 9, 1385–1403.
1802
Radford M. Neal and Peter Dayan
Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmholtz machine. Neural Computation, 7, 889–904. Dempster, A. P., Laird, N. M, & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Proceedings of the Royal Statistical Society, B39, 1–38. Everitt, B. S. (1984). An introduction to latent variable models. London: Chapman and Hall. Felleman D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1, 1–47. Foldi´ ¨ ak, P. (1989). Adaptive network for optimal linear feature extraction. In Proceedings of the International Joint Conference on Neural Networks, Washington, DC (Vol. I, pp. 401–405). Grenander, U. (1976–1981). Lectures in pattern theory I, II and III: Pattern analysis, pattern synthesis and regular structures. Berlin: Springer-Verlag. Hasselmo, M. E., & Bower, J. M. (1993). Acetylcholine and memory. Trends in Neurosciences, 16, 218–222. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley. Hinton, G. E., Dayan, P., Frey, B., & Neal, R. M. (1995). The wake-sleep algorithm for self-organizing neural networks. Science, 268, 1158–1160. Hinton, G. E., & Zemel, R. S. (1994). Autoencoders, minimum description length and Helmholtz free energy. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 3–10). San Mateo, CA: Morgan Kaufmann. Jolliffe, I. T. (1986) Principal component analysis, New York: Springer-Verlag. Joreskog, ¨ K. G. (1967). Some contributions to maximum likelihood factor analysis, Psychometrika, 32, 443–482. Joreskog, ¨ K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis, Psychometrika, 34, 183–202. Joreskog, ¨ K. G. (1977). Factor analysis by least-squares and maximum-likelihood methods. In K. Enslein, A. Ralston, & H. S. Wilf (Eds.), Statistical methods for digital computers. New York: Wiley. Linsker, R. (1986). From basic network principles to neural architecture. Proceedings of the National Academy of Sciences, 83, 7508–7512, 8390–8394, 8779–9783. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21, 105– 128. Miller, K. D. (1994). A model for the development of simple cell receptive fields and the ordered arrangement of orientation columns through activitydependent competition between ON- and OFF-center inputs. Journal of Neuroscience, 14, 409–441. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Computation, 6, 100–126. Montague, P. R., & Sejnowski, T. J. (1994). The predictive brain: Temporal coincidence and temporal order in synaptic learning mechanisms. Learning and Memory, 1, 1–33.
Factor Analysis Using Delta-Rule Wake-Sleep Learning
1803
Mumford, D. (1994). Neuronal architectures for pattern-theoretic problems. In C. Koch & J. Davis (Eds.), Large-scale theories of the cortex (pp. 125–152). Cambridge, MA: MIT Press. Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of Neural Systems, 1, 61–68. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Plumbley, M. D. (1993). Efficient information transfer and anti-Hebbian neural networks. Neural Networks, 6, 823–833. Rao, P. N. R., & Ballard, D. H. (1995). Dynamic model of visual memory predicts neural response properties in the visual cortex (Technical Rep. No. 95.4). Rochester, NY: Department of Computer Science, University of Rochester. Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: The effectiveness of reinforcement and non-reinforcement. In A. H. Black & W. F. Prokasy (Eds.), Classical conditioning II: Current research and theory (pp. 64–69). New York: Appleton-Century-Crofts. Rubin, D. B., & Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychometrika, 47, 69–76. Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2, 459–473. Sutton, R. S., & Barto, A. G. (1981). Toward a modern theory of adaptive networks: Expectation and prediction. Psychological Review, 88, 135–170. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetic, 14, 85–100. von der Malsburg, C., & Willshaw D. J. (1977). How to label nerve cells so that they can interconnect in an ordered fashion. Proceedings of the National Academy of Sciences, 74, 5176–5178. Willshaw, D. J., & von der Malsburg, C. (1976). How patterned neural connections can be set up by self-organisation. Proceedings of the Royal Society of London B, 194, 431–445. Willshaw, D. J., & von der Malsburg, C. (1979). A marker induction mechanism for the establishment of ordered neural mappings: Its application to the retinotectal problem. Philosophical Transactions of the Royal Society B, 287, 203–243. Received August 7, 1996; accepted April 3, 1997.
Communicated by Joachim Buhmann
Data Clustering Using a Model Granular Magnet Marcelo Blatt Shai Wiseman Eytan Domany Department of Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel
We present a new approach to clustering, based on the physical properties of an inhomogeneous ferromagnet. No assumption is made regarding the underlying distribution of the data. We assign a Potts spin to each data point and introduce an interaction between neighboring points, whose strength is a decreasing function of the distance between the neighbors. This magnetic system exhibits three phases. At very low temperatures, it is completely ordered; all spins are aligned. At very high temperatures, the system does not exhibit any ordering, and in an intermediate regime, clusters of relatively strongly coupled spins become ordered, whereas different clusters remain uncorrelated. This intermediate phase is identified by a jump in the order parameters. The spin-spin correlation function is used to partition the spins and the corresponding data points into clusters. We demonstrate on three synthetic and three real data sets how the method works. Detailed comparison to the performance of other techniques clearly indicates the relative success of our method. 1 Introduction In recent years there has been significant interest in adapting numerical (Kirkpatrick, Gelatt, & Vecchi, 1983) and analytic (Fu & Anderson, 1986; M´ezard & Parisi, 1986) techniques from statistical physics to provide algorithms and estimates for good approximate solutions to hard optimization problems (Yuille & Kosowsky, 1994). In this article we formulate the problem of data clustering as that of measuring equilibrium properties of an inhomogeneous Potts model. We are able to give good clustering solutions by solving the physics of this model. Cluster analysis is an important technique in exploratory data analysis, where a priori knowledge of the distribution of the observed data is not available (Duda & Hart, 1973; Jain & Dubes, 1988). Partitional clustering methods, which divide the data according to natural classes present in it, have been used in a large variety of scientific disciplines and engineering applications, among them pattern recognition (Duda & Hart, 1973), learning theory (Moody & Darken, 1989), astrophysics (Dekel & West, 1985), medical Neural Computation 9, 1805–1842 (1997)
c 1997 Massachusetts Institute of Technology °
1806
Marcelo Blatt, Shai Wiseman, and Eytan Domany
imaging (Suzuki, Shibata, & Suto, 1995) and data processing (Phillips et al., 1995), machine translation of text (Cranias, Papageorgiou, & Piperdis, 1994), image compression (Karayiannis, 1994), satellite data analysis (Baraldi & Parmiggiani, 1995), automatic target recognition (Iokibe, 1994), and speech recognition (Kosaka & Sagayama, 1994) and analysis (Foote & Silverman, 1994). The goal is to find a partition of a given data set into several compact groups. Each group indicates the presence of a distinct category in the measurements. The problem of partitional clustering can be formally stated as follows. Determine the partition of N given patterns {vi }N i=1 into groups, called clusters, such that the patterns of a cluster are more similar to each other than to patterns in different clusters. It is assumed that either dij , the measure of dissimilarity between patterns vi and vj , is provided or that each pattern vi is represented by a point xEi in a D-dimensional metric space, in xi − xEj |. which case dij = |E The two main approaches to partitional clustering are called parametric and nonparametric. In parametric approaches some knowledge of the clusters’ structure is assumed, and in most cases patterns can be represented by points in a D-dimensional metric space. For instance, each cluster can be parameterized by a center around which the points that belong to it are spread with a locally gaussian distribution. In many cases the assumptions are incorporated in a global criterion whose minimization yields the “optimal” partition of the data. The goal is to assign the data points so that the criterion is minimized. Classical approaches are variance minimization, maximal likelihood, and fitting gaussian mixtures. A nice example of variance minimization is the method proposed by Rose, Gurewitz, and Fox (1990) based on principles of statistical physics, which ensures an optimal solution under certain conditions. This work gave rise to other mean field methods for clustering data (Buhmann & Kuhnel, ¨ 1993; Wong, 1993; Miller & Rose, 1996). Classical examples of fitting gaussian mixtures are the Isodata algorithm (Ball & Hall, 1967) or its sequential relative, the K-means algorithm (MacQueen, 1967) in statistics, and soft competition in neural networks (Nowlan & Hinton, 1991). In many cases of interest, however, there is no a priori knowledge about the data structure. Then it is more natural to adopt nonparametric approaches, which make fewer assumptions about the model and therefore are suitable to handle a wider variety of clustering problems. Usually these methods employ a local criterion, against which some attribute of the local structure of the data is tested, to construct the clusters. Typical examples are hierarchical techniques such as the agglomerative and divisive methods (see Jain & Dubes, 1988). These algorithms suffer, however, from at least one of the following limitations: high sensitivity to initialization, poor performance when the data contain overlapping clusters, or an inability to handle variabilities in cluster shapes, cluster densities, and cluster sizes. The most serious problem is the lack of cluster validity criteria; in particular, none of
Data Clustering
1807
these methods provides an index that could be used to determine the most significant partitions among those obtained in the entire hierarchy. All of these algorithms tend to create clusters even when no natural clusters exist in the data. We recently introduced a new approach to clustering, based on the physical properties of a magnetic system (Blatt, Wiseman, & Domany, 1996a, 1996b, 1996c). This method has a number of rather unique advantages: it provides information about the different self-organizing regimes of the data; the number of “macroscopic” clusters is an output of the algorithm; and hierarchical organization of the data is reflected in the manner the clusters merge or split when a control parameter (the physical temperature) is varied. Moreover, the results are completely insensitive to the initial conditions, and the algorithm is robust against the presence of noise. The algorithm is computationally efficient; equilibration time of the spin system scales with N, the number of data points, and is independent of the embedding dimension D. In this article we extend our work by demonstrating the efficiency and performance of the algorithm on various real-life problems. Detailed comparisons with other nonparametric techniques are also presented. The outline of the article is as follows. The magnetic model and thermodynamic definitions are introduced in section 2. A very efficient Monte Carlo method used for calculating the thermodynamic quantities is presented in section 3. The clustering algorithm is described in section 4. In section 5 we analyze synthetic and real data to demonstrate the main features of the method and compare its performance with other techniques. 2 The Potts Model Ferromagnetic Potts models have been studied extensively for many years (see Wu, 1982 for a review). The basic spin variable s can take one of q integer values: s = 1, 2, . . . , q. In a magnetic model the Potts spins are located at points vi that reside on (or off) the sites of some lattice. Pairs of spins associated with points i and j are coupled by an interaction of strength Jij > 0. Denote by S a configuration of the system, S = {si }N i=1 . The energy of such a configuration is given by the Hamiltonian
H(S) =
X
¡ ¢ Jij 1 − δsi ,sj
si = 1, . . . , q ,
(2.1)
hi,ji
where the notation hi, ji stands for neighboring sites vi and vj . The contribution of a pair hi, ji to H is 0 when si = sj , that is, when the two spins are aligned, and is Jij > 0 otherwise. If one chooses interactions that are a decreasing function of the distance dij ≡ d(vi , vj ), then the closer two points are to each other, the more they “like” to be in the same state. The Hamiltonian (see equation 2.1) is very similar to other energy functions used in neural
1808
Marcelo Blatt, Shai Wiseman, and Eytan Domany
systems, where each spin variable represents a q-state neuron with an excitatory coupling to its neighbors. In fact, magnetic models have inspired many neural models (see, for example, Hertz, Krogh, & Palmer, 1991). In order to calculate the thermodynamic average of a physical quantity A at a fixed temperature T, one has to calculate the sum hAi =
X
A(S) P(S) ,
(2.2)
S
where the Boltzmann factor, P(S) =
µ ¶ 1 H(S) exp − , Z T
(2.3)
plays the role of the probability density, which gives the statistical weight of each spin configuration SP= {si }N i=1 in thermal equilibrium and Z is a normalization constant, Z = S exp(−H(S)/T). Some of the most important physical quantities A for this magnetic system are the order parameter or magnetization and the set of δsi ,sj functions, because their thermal averages reflect the ordering properties of the model. The order parameter of the system is hmi, where the magnetization, m(S), associated with a spin configuration S is defined (Chen, Ferrenberg, & Landau, 1992) as m(S) =
q Nmax (S) − N (q − 1) N
(2.4)
with ª © Nmax (S) = max N1 (S), N2 (S), . . . , Nq (S) , P where Nµ (S) is the number of spins with the value µ; Nµ (S) = i δsi ,µ . The thermal average of δsi ,sj is called the spin-spin correlation function, ® Gij = δsi ,sj ,
(2.5)
which is the probability of the two spins si and sj being aligned. When the spins are on a lattice and all nearest-neighbor couplings are equal, Jij = J, the Potts system is homogeneous. Such a model exhibits two phases. At high temperatures the system is paramagnetic or disordered, hmi = 0, indicating that Nmax (S) ≈ N/q for all statistically significant configurations. In this phase the correlation function Gij decays to 1/q when the distance between points vi and vj is large; this is the probability of finding two completely independent Potts spins in the same state. At very high temperatures even neighboring sites have Gij ≈ 1/q.
Data Clustering
1809
As the temperature is lowered, the system undergoes a sharp transition to an ordered, ferromagnetic phase; the magnetization jumps to hmi 6= 0. This means that in the physically relevant configurations (at low temperatures), one Potts state “dominates” and Nmax (S) exceeds N/q by a macroscopic number of sites. At very low temperatures hmi ≈ 1 and Gij ≈ 1 for all pairs {vi , vj }. The variance of the magnetization is related to a relevant thermal quantity, the susceptibility, χ=
´ N³ 2 hm i − hmi2 , T
(2.6)
which also reflects the thermodynamic phases of the system. At low temperatures, fluctuations of the magnetizations are negligible, so the susceptibility χ is small in the ferromagnetic phase. The connection between Potts spins and clusters of aligned spins was established by Fortuin and Kasteleyn (1972). In the article appendix we present such a relation and the probability distribution of such clusters. We turn now to strongly inhomogeneous Potts models. This is the situation when the spins form magnetic “grains,” with very strong couplings between neighbors that belong to the same grain and very weak interactions between all other pairs. At low temperatures, such a system is also ferromagnetic, but as the temperature is raised, the system may exhibit an intermediate, superparamagnetic phase. In this phase strongly coupled grains are aligned (that is, are in their respective ferromagnetic phases), while there is no relative ordering of different grains. At the transition temperature from the ferromagnetic to superparamagnetic phase a pronounced peak of χ is observed (Blatt et al., 1996a). In the superparamagnetic phase fluctuations of the state taken by grains acting as a whole (that is, as giant superspins) produce large fluctuations in the magnetization. As the temperature is raised further, the superparamagneticto-paramagnetic transition is reached; each grain disorders, and χ abruptly diminishes by a factor that is roughly the size of the largest cluster. Thus, the temperatures where a peak of the susceptibility occurs and the temperatures at which χ decreases abruptly indicate the range of temperatures in which the system is in its superparamagnetic phase. In principle one can have a sequence of several transitions in the superparamagnetic phase. As the temperature is raised, the system may break first into two clusters, each of which breaks into more (macroscopic) subclusters, and so on. Such a hierarchical structure of the magnetic clusters reflects a hierarchical organization of the data into categories and subcategories. To gain some analytic insight into the behavior of inhomogeneous Potts ferromagnets, we calculated the properties of such a “granular” system with a macroscopic number of bonds for each spin. For such “infinite-range” models, mean field is exact, and we have shown (Wiseman, Blatt, & Domany,
1810
Marcelo Blatt, Shai Wiseman, and Eytan Domany
1996; Blatt et al., 1996b) that in the paramagnetic phase, the spin state at each site is independent of any other spin, that is, Gij = 1/q. At the paramagnetic-superparamagnetic transition the correlation between spins belonging to the same group jumps abruptly to µ ¶ µ ¶ 2 1 q−1 q−2 2 1 + '1− +O 2 , q q−1 q q q while the correlation between spins belonging to different groups is unchanged. The ferromagnetic phase is characterized by strong correlations between all spins of the system: µ ¶ q−1 q−2 2 1 + . Gij > q q−1 q There is an important lesson to remember from this: in mean field we see that in the superparamagnetic phase, two spins that belong to the same grain are strongly correlated, whereas for pairs that do not belong to the same grain, Gij is small. As it turns out, this double-peaked distribution of the correlations is not an artifact of mean field and will be used in our solution of the problem of data clustering. As we will show below, we use the data points of our clustering problem as sites of an inhomogeneous Potts ferromagnet. Presence of clusters in the data gives rise to magnetic grains of the kind described above in the corresponding Potts model. Working in the superparamagnetic phase of the model, we use the values of the pair correlation function of the Potts spins to decide whether a pair of spins does or does not belong to the same grain, and we identify these grains as the clusters of our data. This is the essence of our method. 3 Monte Carlo Simulation of Potts Models: The Swendsen-Wang Method The aim of equilibrium statistical mechanics is to evaluate sums such as equation 2.2 for models with N À 1 spins.1 This can be done analytically only for very limited cases. One resorts therefore to various approximations (such as mean field) or to computer simulations that aim at evaluating thermal averages numerically. Direct evaluation of sums like equation 2.2 is impractical, since the number of configurations S increases exponentially with the system size N. Monte Carlo simulations methods (see Binder & Heermann, 1988, for an introduction) overcome this problem by generating a characteristic subset of configurations, which are used as a statistical sample. They are based 1 Actually one is usually interested in the thermodynamic limit, for example, when the number of spins N → ∞.
Data Clustering
1811
on the notion of importance sampling, in which a set of spin configurations {S1 , S2 , . . . , SM } is generated according to the Boltzmann probability distribution (see equation 2.3). Then, expression 2.2 is reduced to a simple arithmetic average, hAi ≈
M 1 X A(Si ), M i
(3.1)
where the number of configurations in the sample, M, is much smaller than qN , the total number of configurations. The set of M states necessary for the implementation of equation 3.1 is constructed by means of a Markov process in the configuration space of the system. There are many ways to generate such a Markov chain; in this work it turned out to be essential to use the Swendsen-Wang (Wang & Swendsen, 1990; Swendsen, Wang, & Ferrenberg,1992) Monte Carlo algorithm (SW). The main reason for this choice is that it is perfectly suitable for working in the superparamagnetic phase: it overturns an aligned cluster in one Monte Carlo step, whereas algorithms that use standard local moves will take forever to do this. The first configuration can be chosen at random (or by setting all si = 1). Say we already generated n configurations of the system, {Si }ni=1 , and we start to generate configuration n + 1. This is the way it is done. First, visit all pairs of spins hi, ji that interact, that is, have Jij > 0; the two spins are frozen together with probability f pi,j
µ
¶ Jij = 1 − exp − δsi ,sj . T
(3.2)
That is, if in our current configuration Sn the two spins are in the same state, si = sj , then sites i and j are frozen with probability p f = 1 − exp(−Jij /T). Having gone over all the interacting pairs, the next step of the algorithm is to identify the SW clusters of spins. An SW cluster contains all spins that have a path of frozen bonds connecting them. Note that according to equation 3.2, only spins of the same value can be frozen in the same SW cluster. After this step, our N sites are assigned to some number of distinct SW clusters. If we think of the N sites as vertices of a graph whose edges are the interactions between neighbors Jij > 0, each SW cluster is a subgraph of vertices connected by frozen bonds. The final step of the procedure is to generate the new spin configuration Sn+1 . This is done by drawing, independently for each SW cluster, randomly a value s = 1, . . . , q, which is assigned to all its spins. This defines one Monte Carlo step Sn → Sn+1 . By iterating this procedure M times while calculating at each Monte Carlo step the physical quantity A(Si ) the thermodynamic average (see equation 3.1) is obtained. The physical quantities that we are interested in are the magnetization (see equation 2.4) and its square
1812
Marcelo Blatt, Shai Wiseman, and Eytan Domany
value for the calculation of the susceptibility χ, and the spin-spin correlation function (see equation 2.5). Actually, in most simulations a number of the early configurations are discarded, to allow the system to “forget” its initial state. This is not necessary if the number of configurations M is not too small (increasing M improves the statistical accuracy of the Monte Carlo measurement). Measuring autocorrelation times (Gould & Tobochnik, 1989) provides a way of both deciding on the number of discarded configurations and checking that the number of configurations M generated is sufficiently large. A less rigorous way is simply plotting the energy as a function of the number of SW steps and verifying that the energy reached a stable regime. At temperatures where large regions of correlated spins occur, local methods (such as Metropolis), which flip one spin at a time, become very slow. The SW procedure overcomes this difficulty by flipping large clusters of aligned spins simultaneously. Hence the SW method exhibits much smaller autocorrelation times than local methods. The efficiency of the SW method, which is widely used in numerous applications, has been tested in various Potts (Billoire et al., 1991) and Ising (Hennecke & Heyken, 1993) models. 4 Clustering of Data: Detailed Description of the Algorithm So far we have defined the Potts model, the various thermodynamic functions that one measures for it, and the (numerical) method used to measure these quantities. We can now turn to the problem for which these concepts will be utilized: clustering of data. For the sake of concreteness, assume that our data consist of N patterns or measurements vi , specified by N corresponding vectors xEi , embedded in a D-dimensional metric space. Our method consists of three stages. The starting point is the specification of the Hamiltonian (see equation 2.1), which governs the system. Next, by measuring the susceptibility χ and magnetization as a function of temperature, the different phases of the model are identified. Finally, the correlation of neighboring pairs of spins, Gij , is measured. This correlation function is then used to partition the spins and the corresponding data points into clusters. The outline of the three stages and the subtasks contained in each can be summarized as follows: 1. Construct the physical analog Potts spin problem: (a) Associate a Potts spin variable si = 1, 2, . . . , q to each point vi . (b) Identify the neighbors of each point vi according to a selected criterion. (c) Calculate the interaction Jij between neighboring points vi and vj . 2. Locate the superparamagnetic phase:
Data Clustering
1813
(a) Estimate the (thermal) average magnetization, hmi, for different temperatures. (b) Use the susceptibility χ to identify the superparamagnetic phase. 3. In the superparamagnetic regime: (a) Measure the spin-spin correlation, Gij , for all neighboring points vi , vj . (b) Construct the data clusters. In the following subsections we provide detailed descriptions of the manner in which each of the three stages is to be implemented. 4.1 The Physical Analog Potts Spin Problem. The goal is to specify the Hamiltonian of the form in equation 2.1, that serves as the physical analog of the data points to be clustered. One has to assign a Potts spin to each data point and introduce short-range interactions between spins that reside on neighboring points. Therefore we have to choose the value of q, the number of possible states a Potts spin can take, define what is meant by neighbor points, and provide the functional dependence of the interaction strength Jij on the distance between neighboring spins. We discuss now the possible choices for these attributes of the Hamiltonian and their influence on the algorithm’s performance. The most important observation is that none of them needs fine tuning; the algorithm performs well provided a reasonable choice is made, and the range of reasonable choices is very wide. 4.1.1 The Potts Spin Variables. The number of Potts states, q, determines mainly the sharpness of the transitions and the temperatures at which they occur. The higher the q, the sharper the transition.2 On the other hand, in order to maintain a given statistical accuracy, it is necessary to perform longer simulations as the value the q increases. From our simulations we conclude that the influence of q on the resulting classification is weak. We used q = 20 in all the examples presented in this work. Note that the value of q does not imply any assumption about the number of clusters present in the data. 4.1.2 Identifying Neighbors. The need for identification of the neighbors of a point xEi could be eliminated by letting all pairs i, j of Potts spins interact with each other via a short-range interaction Jij = f (dij ), which decays sufficiently fast (say, exponentially or faster) with the distance between the 2 For a two-dimensional regular lattice, one must have q > 4 to ensure that the transition is of first order, in which case the order parameter exhibits a discontinuity (Baxter, 1973; Wu, 1982).
1814
Marcelo Blatt, Shai Wiseman, and Eytan Domany
two data points. The phases and clustering properties of the model will not be affected strongly by the choice of f . Such a model has O(N2 ) interactions, which makes its simulation rather expensive for large N. For the sake of computational convenience, we decided to keep only the interactions of a spin with a limited number of neighbors, and setting all other Jij to zero. Since the data do not form a regular lattice, one has to supply some reasonable definition for “neighbors.” As it turns out, our results are quite insensitive to the particular definition used. Ahuja (1982) argues for intuitively appealing characteristics of Delaunay triangulation over other graphs structures in data clustering. We use this definition when the patterns are embedded in a low-dimensional (D ≤ 3) space. For higher dimensions, we use the mutual neighborhood value; we say that vi and vj have a mutual neighborhood value K, if and only if vi is one of the K-nearest neighbors of vj and vj is one of the K-nearest neighbors of vi . We chose K such that the interactions connect all data points to one connected graph. Clearly K grows with the dimensionality. We found convenient, in cases of very high dimensionality (D > 100), to fix K = 10 and to superimpose to the edges obtained with this criterion the edges corresponding to the minimal spanning tree associated with the data. We use this variant only in the examples presented in sections 5.2 and 5.3. 4.1.3 Local Interaction. In order to have a model with the physical properties of a strongly inhomogeneous granular magnet, we want strong interaction between spins that correspond to data from a high-density region and weak interactions between neighbors that are in low-density regions. To this end and in common with other local methods, we assume that there is a local length scale ∼ a, which is defined by the high-density regions and is smaller than the typical distance between points in the low-density regions. This a is the characteristic scale over which our short-range interactions decay. We tested various choices but report here only results that were obtained using Jij =
1
b K 0
µ 2¶ d exp − 2aij2
if vi and vj are neighbors
(4.1)
otherwise.
We chose the local length scale, a, to be the average of all distances dij between b is the average number of neighbors; it is twice neighboring pairs vi and vj . K the number of nonvanishing interactions divided by the number of points N. This careful normalization of the interaction strength enables us to estimate the temperature corresponding to the highest superparamagnetic transition (see section 4.2). Everything done so far can be easily implemented in the case when instead of providing the xEi for all the data we have an N × N matrix of dissim-
Data Clustering
1815
ilarities dij . This was tested in experiments for clustering of images where only a measure of the dissimilarity between them was available (Gdalyahu & Weinshall, 1997). Application of other clustering methods would have necessitated embedding these data in a metric space; the need for this was eliminated by using superparamagnetic clustering. The results obtained by applying the method on the matrix of dissimilarities3 of these images were excellent; all points were classified with no error. 4.2 Locating the Superparamagnetic Regions. The various temperature intervals in which the system self-organizes into different partitions to clusters are identified by measuring the susceptibility χ as a function of temperature. We start by summarizing the Monte Carlo procedure and conclude by providing an estimate of the highest transition temperature to the superparamagnetic regime. Starting from this estimate, one can take increasingly refined temperature scans and calculate the function χ(T) by Monte Carlo simulation. We used the SW method described in section 3, with the following procedure: 1. Choose the number of iterations M to be performed. 2. Generate the initial configuration by assigning a random value to each spin. 3. Assign a frozen bond between nearest neighbors points vi and vj with f
probability pi,j (see equation 3.2). 4. Find the connected subgraphs, the SW clusters. 5. Assign new random values to the spins (spins that belong to the same SW cluster are assigned the same value). This is the new configuration of the system. 6. Calculate the value assumed by the physical quantities of interest in the new spin configuration. 7. Go to step 3 unless the maximal number of iterations, M, was reached. 8. Calculate the averages (see equation 3.1). The superparamagnetic phase can contain many different subphases with different ordering properties. A typical example can be generated by data with a hierarchical structure, giving rise to different acceptable partitions of the data. We measure the susceptibility χ at different temperatures in order to locate these different regimes. The aim is to identify the temperatures at which the system changes its structure. 3
Interestingly, the triangle inequality was violated in about 5 percent of the cases.
1816
Marcelo Blatt, Shai Wiseman, and Eytan Domany
The superparamagnetic phase is characterized by a nonvanishing susceptibility. Moreover, there are two basic features of χ in which we are interested. The first is a peak in the susceptibility, which signals a ferromagneticto-superparamagnetic transition, at which a large cluster breaks into a few smaller (but still macroscopic) clusters. The second feature is an abrupt decrease of the susceptibility, corresponding to a superparamagnetic-toparamagnetic transition, in which one or more large clusters have melted (i.e., they broke up into many small clusters). The location of the superparamagnetic-to-paramagnetic transition, which occurs at the highest temperature, can be roughly estimated by the following considerations. First, we approximate the clusters by an ordered lattice b and a constant interaction of coordination number K DD EE Ã ** !++ d2ij d2ij ®® 1 1 exp − 2 exp− 2 , ≈ J ≈ Jij = b b 2a 2a K K where hh· · ·ii denotes the average over all neighbors. Second, from the Potts model on a square lattice (Wu, 1982), we get that this transition should occur at roughly DD EE d2ij 1 T≈ √ exp− 2 . 4 log(1 + q) 2a
(4.2)
An estimate based on the mean field model yields a very similar value. 4.3 Identifying the Data Clusters. Once the superparamagnetic phase and its different subphases have been identified, we select one temperature in each region of interest. The rationale is that each subphase characterizes a particular type of partition of the data, with new clusters merging or breaking. On the other hand, as the temperature is varied within a phase, one expects only shrinking or expansion of the existing clusters, changing only the classification of the points on the boundaries of the clusters. 4.3.1 The Spin-Spin Correlation. We use the spin-spin correlation function Gij , between neighboring sites vi and vj , to build the data clusters. In principle we have to calculate the thermal average (see equation 3.1) of δsi ,sj in order to obtain Gij . However, the SW method provides an improved estimator (Niedermayer, 1990) of the spin-spin correlation function. One calculates the two-point connectedness Cij , the probability that sites vi and vj belong to the same SW cluster, which is estimated by the average (see equation 3.1) of the following indicator function: ½ 1 if vi and vj belong to the same SW cluster cij = 0 otherwise.
Data Clustering
1817
® Cij = cij is the probability of finding sites vi and vj in the same SW cluster. Then the relation (Fortuin & Kasteleyn, 1972) Gij =
(q − 1) Cij + 1 q
(4.3)
is used to obtain the correlation function Gij . 4.3.2 The Data Clusters.
Clusters are identified in three steps:
1. Build the clusters’ core using a thresholding procedure; if Gij > 0.5, a link is set between the neighbor data points vi and vj . The resulting connected graph depends weakly on the value (0.5) used in this thresholding, as long as it is bigger than 1/q and less than 1 − 2/q. The reason is that the distribution of the correlations between two neighboring spins peaks strongly at these two values and is very small between them (see Figure 3b). 2. Capture points lying on the periphery of the clusters by linking each point vi to its neighbor vj of maximal correlation Gij . It may happen, of course, that points vi and vj were already linked in the previous step. 3. Data clusters are identified as the linked components of the graphs obtained in steps 1 and 2. Although it would be completely equivalent to use in steps 1 and 2 the two-point connectedness, Cij , instead of the spin-spin correlation, Gij , we considered the latter to stress the relation of our method with the physical analogy we are using. 5 Applications The approach presented in this article has been successfully tested on a variety of data sets. The six examples we discuss were chosen with the intention of demonstrating the main features and utility of our algorithm, to which we refer as the superparamagnetic clustering (SPC) method. We use both artificial and real data. Comparisons with the performance of other classical (nonparametric) methods are also presented. We refer to different clustering methods by the nomenclature used by Jain and Dubes (1988) and Fukunaga (1990). The nonparametric algorithms we have chosen belong to four families: (1) hierarchical methods: single link and complete link; (2) graph theory–based methods: Zhan’s minimal spanning tree and Fukunaga’s directed graph method; (3) nearest-neighbor clustering type, based on different proximity measures: the mutual neighborhood clustering algorithm and k-shared neighbors; and (4) density estimation: Fukunaga’s valley-seeking method. These algorithms are of the same kind as the superparamagnetic method
1818
Marcelo Blatt, Shai Wiseman, and Eytan Domany
in the sense that only weak assumptions are required about the underlying data structure. The results from all these methods depend on various parameters in an uncontrolled way; we always used the best result that was obtained. A unifying view of some of these methods in the framework of the work discussed here is presented in the article appendix. 5.1 A Pedagogical 2-Dimensional Example. The main purpose of this simple example is to illustrate the features of the method discussed, in particular the behavior of the susceptibility and its use for the identification of the two kinds of phase transitions. The influence of the number of Potts states, q, and the partition of the data as a function of the temperature are also discussed. The toy problem of Figure 1 consists of 4800 points in D = 2 dimensions whose angular distribution is uniform and whose radial distribution is normal with variance 0.25; θ ∼ U[0, 2π] r ∼ N[R, 0.25] ; we generated half the points with R = 3, one-third with R = 2, and one-sixth with R = 1. Since there is a small overlap between the clusters, we consider the Bayes solution as the optimal result; that is, points whose distance to the origin is bigger than 2.5 are considered a cluster, points whose radial coordinate lies between 1.5 and 2.5 are assigned to a second cluster, and the remaining points define the third cluster. These optimal clusters consist of 2393, 1602, and 805 points, respectively. By applying our procedure, and choosing the neighbors according to the mutual neighborhood criterion with K = 10, we obtain the susceptibility as a function of the temperature as presented in Figure 2a. The estimated temperature (4.2) corresponding to the superparamagnetic-toparamagnetic transition is 0.075, which is in good agreement with the one inferred from Figure 2a. Figure 1 presents the clusters obtained at T = 0.05. The sizes of the three largest clusters are 2358, 1573, and 779, including 98 percent of the data; the classification of all these points coincides with that of the optimal Bayes classifier. The remaining 90 points are distributed among 43 clusters of size smaller than 4. As can be noted in Figure 1, the small clusters (fewer than 4 points) are located at the boundaries between the main clusters. One of the most salient features of the SPC method is that the spin-spin correlation function, Gij , reflects the existence of two categories of neighboring points: neighboring points that belong to the same cluster and those that do not. This can be observed from Figure 3b, the two-peaked frequency distribution of the correlation function Gij between neighboring points of
Data Clustering
1819
4
2
0
-2
Outer Cluster Central Cluster Inner Cluster Unclassified Points -4
-4
-2
0
2
4
Figure 1: Data distribution: The angular coordinate is uniformly distributed, that is, U[0, 2π ], while the radial one is normal N[R, 0.25] distributed around three different radius R. The outer cluster (R = 3.0) consists of 2400 points, the central one (R = 2.0) of 1600, and the inner one (R = 1.0) of 800. The classified data set: Points classified at T = 0.05 as belonging to the three largest clusters are marked by circles (outer cluster, 2358 points), squares (central cluster, 1573 points), and triangles (inner cluster, 779 points). The x’s denotes the 90 remaining points, which are distributed in 43 clusters, the biggest of size 4.
Figure 1. In contrast, the frequency distribution (see Figure 3a) of the normalized distances dij /a between neighboring points of Figure 1 contains no hint of the existence of a natural cutoff distance, that separates neighboring points into two categories. It is instructive to observe the behavior of the size of the clusters as a function of the temperature, presented in Figure 2b. At low temperatures, as expected, all data points form only one cluster. At the ferromagneticto-superparamagnetic transition temperature, indicated by a peak in the susceptibility, this cluster splits into three. These essentially remain stable in
1820
Marcelo Blatt, Shai Wiseman, and Eytan Domany
0.050
(a)
cT/N
0.040 0.030 0.020 0.010 0.000
0.00
0.02
0.04
0.06
0.08
0.10
5000
(b) Cluster Size
4000 3000 outer cluster
2000 st
1000 0 0.00
1 cluster nd 2 cluster rd 3 cluster th 4 cluster
0.02
central cluster inner cluster
0.04
0.06
0.08
0.10
T
Figure 2: (a) The susceptibility density χ T/N of the data set of Figure 1 as a function of the temperature. (b) Size of the four biggest clusters obtained at each temperature.
their composition until the superparamagnetic-to-paramagnetic transition temperature is reached, expressed in a sudden decrease of the susceptibility χ , where the clusters melt. Turning now to the effect of the parameters on the procedure, we found (Wiseman et al., 1996) that the number of Potts states q affects the sharpness of the transition, but the obtained classification is almost the same. For instance, choosing q = 5 we found that the three largest clusters contained 2349, 1569, and 774 data points, while taking q = 200, we yielded 2354, 1578, and 782. Of all the algorithms listed at the beginning of this section, only the single-link and minimal spanning methods were able to give (at the optimal values of their clustering parameter) a partition that reflects the underlying distribution of the data. The best results are summarized in Table 1, together
Data Clustering
1821
4000
24000
(a)
(b)
22000 3000
8000
6000
2000
4000 1000
2000
0 0.0
1.0
dij/a
2.0
3.0
1/q
1
Gij
Figure 3: Frequency distribution of (a) distances between neighboring points of Figure 1 (scaled by the average distance a), and (b) spin-spin correlation of neighboring points.
Table 1: Clusters Obtained with the Methods That Succeeded in Recovering the Structure of the Data. Method Bayes Superparamagnetic (q = 200) Superparamagnetic (q = 20) Superparamagnetic (q = 5) Single link Minimal spanning tree
Outer Cluster
Central Cluster
Inner Cluster
2393 2354 2358 2349 2255 2262
1602 1578 1573 1569 1513 1487
805 782 779 774 758 756
Unclassified Points — 86 90 108 274 295
Notes: Points belonging to cluster of sizes fewer than 50 points are considered as unclassified points. The Bayes method is used as the benchmark because it is the one that minimizes the expected number of mistakes, provided that the distribution that generated the set of points is known.
with those of the SPC method. Clearly, the standard parametric methods (such as K-means or Ward’s method) would not be able to give a reasonable answer because they assume that different clusters are parameterized by different centers and a spread around them. In Figure 4 we present, for the methods that depend on only a single parameter, the sizes of the four biggest clusters that were obtained as a function of the clustering parameter. The best solution obtained with the single-link method (for a narrow range of the parameter) corresponds also
1822
Marcelo Blatt, Shai Wiseman, and Eytan Domany
cluster size
(a) single linkage
(b) directed graph
5000
5000
4000
4000
3000
3000
2000
2000
1000
1000
0.06
0.08
0.10
0.12
0.14
0.30
cluster size
(c) complete linkage 5000
5000
4000
4000
3000
3000
2000
2000
1000
1000
1.0
3.0
5.0
7.0
6
7
cluster size
(e) valey seeking 5000
4000
4000
3000
3000
2000
2000
1000
1000
0.80
1.00
1.20
clustering parameter
0.90
1.20
8
9
10
(f) shared neighborhood
5000
0.60
0.60
(d) mutual neighborhood
1.40
11
12
13
14
clustering parameter
Figure 4: Size of the three biggest clusters as a function of the clustering parameter obtained with (a) single-link, (b) directed graph, (c) complete-link, (d) mutual neighborhood, (e) valley-seeking, and (f) shared-neighborhood algorithm. The arrow in (a) indicates the region corresponding to the optimal partition for the single-link method. The other algorithms were unable to recover the data structure.
to three big clusters of 2255, 1513, and 758 points, respectively, while the remaining clusters are of size smaller than 14. For larger threshold distance, the second and third clusters are linked. This classification is slightly worse than the one obtained by the superparamagnetic method. When comparing SPC with single link, one should note that if the “correct” answer is not known, one has to rely on measurements such as the stability of the largest clusters (existence of a plateau) to indicate the quality of the partition. As can be observed from Figure 4a there is no clear indication that signals which plateau corresponds to the optimal partition among the whole hierarchy yielded by single link. The best result obtained with the minimal spanning tree method is very similar to the one obtained with
Data Clustering
1823
the single link, but this solution corresponds to a very small fraction of its parameter space. In comparison, SPC allows clear identification of the relevant superparamagnetic phase; the entire temperature range of this regime yields excellent clustering results. 5.2 Only One Cluster. Most existing algorithms impose a partition on the data even when there are no natural classes present. The aim of this example is to show how the SPC algorithm signals this situation. Two different 100-dimensional data sets of 1000 samples are used. The first data set is taken from a gaussian distribution centered at the origin, with covariance matrix equal to the identity. The second data set consists of points generated randomly from a uniform distribution in a hypercube of side 2. The susceptibility curve, which was obtained by using the SPC method with these data sets, is shown in Figures 5a and 5b. The narrow peak and the absence of a plateau indicate that there is only a single-phase transition (ferromagnetic to paramagnetic), with no superparamagnetic phase. This single-phase transition is also evident from Figures 5c and 5d where only one cluster of almost 1000 points appears below the transition. This single “macroscopic” cluster “melts” at the transition, to many “microscopic” clusters of 1 to 3 points in each. Clearly, all existing methods are able to give the correct answer since it is always possible to set the parameters such that this trivial solution is obtained. Again, however, there is no clear indicator for the correct value of the control parameters of the different methods. 5.3 Performance: Scaling with Data Dimension and Influence of Irrelevant Features. The aim of this example is to show the robustness of the SPC method and to give an idea of the influence of the dimension of the data on its performance. To this end, we generated N D–dimensional points whose density distribution is a mixture of two isotropic gaussians, that is,
P(Ex) =
³√ ´−D · 2πσ 2
µ
kE x − yE1 k2 exp − 2σ 2
¶
µ
kE x − yE2 k2 + exp − 2σ 2
¶¸ ,
(5.1)
where yE1 and yE2 are the centers of the gaussians and σ determines its width. Since the two characteristics lengths involved are kE y1 − yE2 k and σ , the relevant parameter of this example is the normalized distance, L=
kE y1 − yE2 k . σ
The manner in which these data points were generated satisfies precisely the hypothesis about data distribution that is assumed by the K-means algorithm. Therefore, it is clear that this algorithm (with K = 2) will achieve the Bayes optimal result; the same will hold for other parametric methods,
1824
Marcelo Blatt, Shai Wiseman, and Eytan Domany
Uniform Distribution
Gaussian Distribution 0.008
0.03
(a)
(b) 0.006
cT/N
0.02 0.004 0.01 0.002
0.00 0.00
0.05
0.10
0.15
0.20
0.25
1000
0.000 0.00
cluster size
(c)
0.10
0.05
0.10
0.15
0.20
0.25
0.15
0.20
0.25
(d)
750
750
500
500
250
250
0 0.00
0.05
1000
0.05
0.10
0.15
T
0.20
0.25
0 0.00
T
Figure 5: Susceptibility density χT/N as a function of the temperature T for data points (a) uniformly distributed in a hypercube of side 2 and (b) multinormally distributed with a covariance matrix equal to the identity in a 100-dimensional space. The sizes of the two biggest clusters obtained at each temperature are presented in (c) and (d), respectively.
such as maximal likelihood (once a two-gaussian distribution for the data is assumed). Although such algorithms have an obvious advantage over SPC for these kinds of data, it is interesting to get a feeling about the loss in the quality of the results, caused by using our method, which relies on fewer assumptions. To this end we considered the case of 4000 points generated in a 200–dimensional space from the distribution in equation 5.1, setting the parameter L = 13.0 σ . The two biggest clusters we obtained were of sizes 1853 and 1816; the smaller ones contained fewer than 4 points each. About 8.0 percent of the points were left unclassified, but all those points that the method did assign to one of the two large clusters were classified in agreement with a Bayes classifier. For comparison we applied the singlelinkage algorithm to the same data; at the best classification point, 74 percent of the points were unclassified. Next we studied the minimal distance, Lc , at which the method is able to recognize that two clusters are present in the data and to find the dependence of Lc on the dimension D and number of samples N. Note that the
Data Clustering
1825
lower bound for the minimal discriminant distance for any nonparametric algorithm is 2 (for any dimension D). Below this distance, the distribution is no longer bimodal; rather, the maximal density of points is located at the midpoint between the gaussian centers. Sets of N = 1000, 2000, 4000, and 8000 samples and space dimensions D = 2, 10, 100, 100, and 1000 were tested. We set the number of neighbors K = 10 and superimposed the minimal spanning tree to ensure that at T = 0, all points belong to the same cluster. To our surprise, we observed that in the range 1000 ≤ N ≤ 8000, the critical distance seems to depend only weakly on the number of samples, N. The second remarkable result is that the critical discriminant distance Lc grows very slowly with the dimensionality of the data points, D. Apparently the minimal discriminant distance Lc increases like the logarithm of the number of dimensions D, Lc ≈ α + β log D,
(5.2)
where α and β do not depend on D. The best fit in the range 2 ≤ D ≤ 1000 yields α = 2.3 ± 0.3 and β = 1.3 ± 0.2. Thus, this example suggests that the dimensionality of the points does not affect the performance of the method significantly. A more careful interpretation is that the method is robust against irrelevant features present in the characterization of the data. Clearly there is only one relevant feature in this problem, which is given by the projection x0 =
yE1 − yE2 · xE . kE y1 − yE2 k
The Bayes classifier, which has the lowest expected error, is implemented by assigning xEi to cluster 1 if x0i < 0 and to cluster 2 otherwise. Therefore we can consider the other D − 1 dimensions as irrelevant features because they do not carry any relevant information. Thus, equation 5.2 is telling us how noise, expressed as the number of irrelevant features present, affects the performance of the method. Adding pure noise variables to the true signal can lead to considerable confusion when classical methods are used (Fowlkes, Gnanadesikan, & Kettering, 1988). 5.4 The Iris Data. The first “real” example we present is the time-honored Anderson-Fisher Iris data, a popular benchmark problem for clustering procedures. It consists of measurement of four quantities, performed on each of 150 flowers. The specimens were chosen from three species of Iris. The data constitute 150 points in four-dimensional space. The purpose of this experiment is to present a slightly more complicated scenario than that of Figure 1. From the projection on the plane spanned by the first two principal components, presented on Figure 6, we observe that there is a well-separated cluster (corresponding to the Iris setosa species)
1826
Marcelo Blatt, Shai Wiseman, and Eytan Domany
second principal component
5.0
4.0
3.0
2.0 Iris Setosa Iris Versicolor Iris Virginica
1.0 2.0
4.0
6.0 8.0 first principal component
10.0
Figure 6: Projection of the iris data on the plane spanned by its two principal components.
while clusters corresponding to the Iris virginia and Iris versicolor do overlap. We determined neighbors in the D = 4 dimensional space according to the mutual K (K = 5) nearest neighbors definition, applied the SPC method, and obtained the susceptibility curve of Figure 7a; it clearly shows two peaks. When heated, the system first breaks into two clusters at T ≈ 0.1. At Tclus = 0.2 we obtain two clusters, of sizes 80 and 40; points of the smaller cluster correspond to the species Iris setosa. At T ≈ 0.6 another transition occurs; the larger cluster splits to two. At Tclus = 0.7 we identified clusters of sizes 45, 40, and 38, corresponding to the species Iris versicolor, virginica, and setosa, respectively. As opposed to the toy problems, the Iris data break into clusters in two stages. This reflects the fact that two of the three species are “closer” to each other than to the third one; the SPC method clearly handles such hierarchical organization of the data very well. Among the samples, 125 were classified correctly (as compared with manual classification); 25 were left unclassified. No further breaking of clusters was observed; all three disorder at Tps ≈ 0.8 (since all three are of about the same density). The best results of all the clustering algorithms used in this work together with those of the SPC method are summarized in Table 2. Among these,
Data Clustering
1827
(a) 0.20
cT/N
0.15
0.10
0.05
0.00 0.00
cluster size
(b)
0.02
0.04
0.06
0.08
0.10
150 st
100
50
1 cluster nd 2 cluster rd 3 cluster th 4 cluster th 5 cluster
Versicolor + Virginica
Setosa
Setosa Virginica
0 0.00
0.02
0.04
0.06
Versicolor
0.08
0.10
T
Figure 7: (a) The susceptibility density χT/N as a function of the temperature and (b) the size of the four biggest clusters obtained at each temperature for the Iris data.
the minimal spanning tree procedure obtained the most accurate result, followed by our method, while the remaining clustering techniques failed to provide a satisfactory result. 5.5 LANDSAT Data. Clustering techniques have been very popular in remote sensing applications (Faber, Hochberg, Kelly, Thomas, & White, 1994; Kelly & White, 1993; Kamata, Kawaguchi, & Niimi, 1995; Larch, 1994; Kamata, Eason, & Kawaguchi, 1991). Multispectral scanners on LANDSAT satellites sense the electromagnetic energy of the light reflected by the earth’s surface in several bands (or wavelengths) of the spectrum. A pixel represents the smallest area on earth’s surface that can be separated from the neighboring areas. The pixel size and the number of bands vary, depending on the scanner; in this case, four bands are utilized, whose pixel resolution is of 80 × 80 meters. Two of the wavelengths are in the visible region, corresponding approximately to green (0.52–0.60 µm ) and red (0.63–0.69µm ), and the other two are in the near-infrared (0.76–0.90 µm ) and mid-infrared (1.55–1.75 µm ) regions. The wavelength interval associated with each band is tuned to a particular cover category. For example, the green band is useful
1828
Marcelo Blatt, Shai Wiseman, and Eytan Domany
Table 2: Best Partition Obtained with Clustering Methods. Method Minimal spanning tree Superparamagnetic Valley seeking Complete link Directed graph K-shared neighbors Single link Mutual neighborhood value
Biggest Cluster
Middle Cluster
Smallest Cluster
50 45 67 81 90 90 101 101
50 40 42 39 30 30 30 30
50 38 37 30 30 30 19 19
Note: Only the minimal spanning tree and the superparamagnetic method returned clusters where points belonging to different Iris species were not mixed.
for identifying areas of shallow water, such as shoals and reefs, whereas the red band emphasizes urban areas. The data consist of 6437 samples that are contained in a rectangle of 82 × 100 pixels. Each “data point” is described by thirty-six features that correspond to a 3 × 3 square of pixels. A classification label (ground truth) of the central pixel is also provided. The data are given in random order, and certain samples have been removed, so that one cannot reconstruct the original image. The data were provided by Srinivasan (1994) and are available at the University of California at Irvine (UCI) Machine Learning Repository (Murphy & Aha, 1994). The goal is to find the “natural classes” present in the data (without using the labels, of course). The quality of our results is determined by the extent to which the clustering reflects the six terrain classes present in the data: red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble, and very damp grey soil. This exercise is close to a real problem of remote sensing, where the true labels (ground truth) on the pixels are not available, and therefore clustering techniques are needed to group pixels on the basis of the sensed observations. We used the projection pursuit method (Friedman, 1987), a dimensionreducing transformation, in order to gain some knowledge about the organization of the data. Among the first six two-dimensional projections that were produced, we present in Figure 8 the one that best reflects the (known) structure of the data. We observe that the clusters differ in their density, there is unequal coupling between clusters, and the density of the points within a cluster is not uniform but rather decreases toward the perimeter of the cluster. The susceptibility curve in Figure 9a reveals four transitions that reflect
Data Clustering
1829
100
cotton crop 60
red soil 20
-20
grey soil very damp grey soil damp grey soil soil with vegetation subble -20
20
60
100
Figure 8: Best two-dimensional projection pursuit among the first six solutions for the LANDSAT data.
the presence of the following hierarchy of clusters (see Figure 10). At the lowest temperature, two clusters, A and B, appear. Cluster A splits at the second transition into A1 and A2 . At the next transition cluster, A1 splits into A11 and A21 . At the last transition cluster, A2 splits into four clusters Ai2 , i = 1, . . . , 4. At this temperature the clusters A1 and B are no longer identifiable; their spins are in a disordered state, since the density of points in A1 and B is significantly smaller than within the Ai2 clusters. Thus, the superparamagnetic method overcomes the difficulty of dealing with clusters of different densities by analyzing the data at several temperatures. This hierarchy indeed reflects the structure of the data. Clusters obtained in the range of temperature 0.08 to 0.12 coincides with the picture obtained by projection pursuit; cluster B corresponds to cotton crop terrain class, A1 to red soil, and the remaining four terrain classes are grouped in the cluster A2 . The clusters A11 and A21 are a partition of the red soil, while A12 , A22 , A32 , and A42 correspond, respectively, to the classes grey soil, very damp grey
1830
Marcelo Blatt, Shai Wiseman, and Eytan Domany
(2)
(a) 0.010
cT/N
0.008 0.006 0.004
(4)
0.002 0.000
(b)
0.04
0.06
0.08
0.10
0.12
0.14
0.16
7000 st
6000
cluster size
(3)
(1)
1 cluster nd 2 cluster rd 3 cluster th 4 cluster
A
5000 A2
4000 3000
A2 : grey soil
2000
A2 : very damp grey soil
1
2
A1 : red soil
0
3
A2 : damp
1
1000
A1
B: cotton crop
0.04
0.06
0.08
0.10
0.12
2
A1
4
A2 : vegetation
0.14
0.16
T
Figure 9: (a) Susceptibility density χT/N of the LANDSAT data as a function of the temperature T. The numbers in parentheses indicate the phase transitions. (b) The sizes of the four biggest clusters at each temperature. The jumps indicate j that a cluster has been split. Symbols A, B, Ai , and Ai correspond to the hierarchy depicted in Figure 10.
soil, damp grey soil and soil with vegetation stubble.4 Ninety-seven percent purity was obtained, meaning that points belonging to different categories were almost never assigned to the same cluster. Only the optimal answer of Fukunaga’s valley-seeking, and our SPC method succeeded in recovering the structure of the LANDSAT data. Fukunaga’s method, however, yielded grossly different answers for different (random) initial conditions; our answer was stable.
4 This partition of the red soil is not reflected in the “true” labels. It would be of interest to reevaluate the labeling and try to identify the features that differentiate the two categories of red soil that were discovered by our method.
Data Clustering
1831
(1)
A
(2)
A1
(3)
1
A2
A1
red soil
2
A1
B cotton crop
(4)
1
2
A2
A2
grey soil
very damp
3
A2
4
A2
damp vegetation g.s.
Figure 10: The LANDSAT data structure reveals a hierarchical structure. The numbers in parentheses correspond to the phase transitions indicated by a peak in the susceptibility (see Figure 9).
5.6 Isolated-Letter Speech Recognition. In the isolated-letter speechrecognition task, the “name” of a single letter is pronounced by a speaker. The resulting audio signal is recorded for all letters of the English alphabet for many speakers. The task is to find the structure of the data, which is expected to be a hierarchy reflecting the similarity that exists between different groups of letters, such as {B, D} or {M, N}, which differ only in a single articulatory feature. This analysis could be useful, for instance, to determine to what extent the chosen features succeed in differentiating the spoken letters. We used the ISOLET database of 7797 examples created by Ron Cole (Fanty & Cole, 1991), which is available at the UCI Machine Learning Repository (Murphy & Aha, 1994). The data were recorded from 150 speakers balanced for sex and representing many different accents and English dialects. Each speaker pronounced each of the twenty-six letters twice (there are three examples missing). Cole’s group has developed a set of 617 features describing each example. All attributes are continuous and scaled into the range −1 to 1. The features include spectral coefficients, contour features,
1832
Marcelo Blatt, Shai Wiseman, and Eytan Domany
and sonorant, presonorant, and postsonorant features. The order of appearance of the features is not known. We applied the SPC method and obtained the susceptibility curve shown in Figure 11a and the cluster size versus temperature curve presented in Figure 11b. The resulting partitioning obtained at different temperatures can be cast in hierarchical form, as presented in Figure 12a. We also tried the projection pursuit method, but none of the first six twodimensional projections succeeded in revealing any relevant characteristic about the structure of the data. In assessing the extent to which the SPC method succeeded in recovering the structure of the data, we built a “true” hierarchy by using the known labels of the examples. To do this, we first calculate the center of each class (letter) by averaging over all the examples belonging to it. Then a matrix 26 × 26 of the distances between these centers is constructed. Finally, we apply the single-link method to construct a hierarchy, using this proximity matrix. The result is presented in Figure 12b. The purity of the clustering was again very high (93 percent), and 35 percent of the samples were left as unclassified points. The cophentic correlation coeffecient validation index (Jain & Dubes, 1988) is equal to 0.98 for this graph, which indicates that this hierarchy fits the data very well. Since our method does not have a natural length scale defined at each resolution, we cannot use this index for our tree. Nevertheless, the good quality of our tree, presented in Figure 12a, is indicated by the good agreement between it and the tree of Figure 12b. In order to construct the reference tree depicted in Figure 12b, the correct label of each point must be known. 6 Complexity and Computational Overhead Nonparametric clustering is performed in two main stages: Stage 1: Determination of the geometrical structure of the problem. Basically a number of nearest neighbors of each point has to be found, using any reasonable algorithm, such as identifying the points lying inside a sphere of a given radius or a given number of closest neighbors (like in the SPC algorithm). Stage 2: Manipulation of the data. Each method is characterized by a specific processing of the data. For almost all methods, including SPC, complexity is determined by the first stage because it deserves more computational effort than the data manipulation itself. Finding the nearest neighbors is an expensive task; the complexity of branch and bound algorithms (Kamgar-Parsi & Kanal, 1985) is of order O(Nν log N) (1 < ν < 2). Since this operation is common for all nonparametric clustering methods, any extra computational overhead our algorithm may have over some other nonparametric method must be due to the difference between the costs of the manipulations performed beyond
Data Clustering
1833
0.010
(a) 0.008
cT/N
0.006
0.004
0.002
0.000
0.04
0.06
0.08
0.10
0.12
0.14
T
Figure 11: (a) Susceptibility density as a function of the temperature for the isolated-letter speech-recognition data. (b) Size of the four biggest clusters returned by the algorithm for each temperature.
1834
Marcelo Blatt, Shai Wiseman, and Eytan Domany
(a)
ABCDEFGHIJKLMNOPQRSTUVWXYZ ABCDEFGHIJKLMNOPQRSTUVXYZ ABCDEFGHJKLMNOPSTVXZ ABCDEGJKLMNOPTVZ ABCDEGJKMNPTVZ ABCDEGJKPTVZ
IRY
W W
QU
HFSX
LO
H
W
FSX IR
MN I
Y
U
Q
R
ABDEGJKPTV CZ
W
BDEGJKPTV A BDEGPTV GPT
P
MN
JK
BDEV
GT E
L
CZ
BD V
JK
A
Q
O O
FS
L
C
U
X X I
H FS
R
Y
ABCDEFGHIJKLMNOPQRSTUVWXYZ
(b)
ABCDEFGHIJKLMNOPRSTVXYZ ABCDEFGHIJKLMNOPRSTVXZ
ABCDEGHJKMNPTVZ
MN
ABDEGHJKPTV ABDEGJKPTV BDEGPTV BDEGPT BDE BD B D
E
V
GPT P
H
M
LO
CZ C
Z
W
U
FIRSX
ABCDEGHJKLMNOPTVZ ABDEGHJKMNPTV
QU
Y Q
L
O
FSX FS
IR X
I
R
F S
N
AJK JK
A J
K
GT G
T
Figure 12: Isolated-letter speech-recognition hierarchy obtained by (a) the superparamagnetic method and (b) using the labels of the data and assuming each letter is well represented by a center.
this stage. The second stage in the SPC method consists of equilibrating a system at each temperature. In general, the complexity is of order N (Binder & Heermann, 1988; Gould & Tobochnik ,1988). Scaling with N. The main reason for choosing an unfrustrated ferromagnetic system, versus a spin glass (where negative interactions are allowed), is that ferromagnets reach thermal equilibrium very fast. Very efficient Monte
Data Clustering
1835
Carlo algorithms (Wang & Swendsen, 1990; Wolff, 1989; Kandel & Domany, 1991) were developed for these systems, in which the number of sweeps needed for thermal equilibration is small at the temperatures of interest. The number of operations required for each SW Monte Carlo sweep scales linearly with the number of edges; it is of order K ×N (Hoshen & Kopelman, 1976). In all the examples of this article we used a fixed number of sweeps (M = 1000). Therefore, the fact that the SPC method relies on a stochastic algorithm does not prevent it from being efficient. Scaling with D. The equilibration stage does not depend on the dimension of the data, D. In fact, it is not necessary to know the dimensionality of the data as long as the distances between neighboring points are known. Since the complexity of the equilibration stage is of order N and does not scale with D, the complexity of the method is determined by the search for the nearest neighbors. Therefore, we conclude that the complexity of our method does not exceed that of the most efficient deterministic nonparametric algorithms. For the sake of concreteness, we present the running times, corresponding to the second stage, on an HP–9000 (series K200) machine for two problems: the LANDSAT and ISOLET data. The corresponding running times were 1.97 and 2.53 minutes per temperature, respectively (0.12 and 0.15 sec per sweep per temperature). Note that there is a good agreement with the discussion presented above; the ratio of the CPU times is close to the ratio of the corresponding total number of edges (18,388 in the LANDSAT and 22,471 in the ISOLET data set), and there is no dependence on the dimensionality. Typical runs involve about twenty temperatures, which leads to 40 and 50 minutes of CPU. This number of temperatures can be significantly reduced by using the Monte Carlo histogram method (Swendsen, 1993), where a set of simulations at small number of temperatures suffices to calculate thermodynamic averages for the complete temperature range of interest. Of all the deterministic methods we used, the most efficient one is the minimal spanning tree. Once the tree is built, it requires only 19 and 23 seconds of CPU, respectively, for each set of clustering parameters. However, the actual running time is determined by how long one spends searching for the optimal parameters in the (three-dimensional) parameter space of the method. The other nonparametric methods presented in this article were not optimized, and therefore comparison of their running times could be misleading. For instance, we used Johnson’s algorithm for implementing the single and complete linkage, which requires O(N3 ) operations for recovering all the hierarchy, but faster versions, based on minimal spanning trees, require fewer operations. Running Friedman’s projection pursuit algorithm,5 whose results are presented in Figure 8, required 55 CPU minutes for LANDSAT. For the case of the ISOLET data (where D = 617)
5
We thank Jerome Friedman for allowing public use of his program.
1836
Marcelo Blatt, Shai Wiseman, and Eytan Domany
the difference was dramatic; projection pursuit required more than a week of CPU time, while SPC required about 1 hour. The reason is that our algorithm does not scale with the dimension of the data D, whereas the complexity of projection pursuit increases very fast with D. 7 Discussion This article proposes a new approach to nonparametric clustering, based on a physical, magnetic analogy. The mapping onto the magnetic problem is very simple. A Potts spin is assigned to each data point, and short-range ferromagnetic interactions between spins are introduced. The strength of these interactions decreases with distance. The thermodynamic system defined in this way presents different self-organizing regimes, and the parameter that determines the behavior of the system is the temperature. As the temperature is varied, the system undergoes many phase transitions. The idea is that each phase reflects a particular data structure related to a particular length scale of the problem. Basically, the clustering obtained at one temperature that belongs to a specific phase should not differ substantially from the partition obtained at another temperature in the same phase. On the other hand, the clustering obtained at two temperatures corresponding to different phases must be significantly different, reflecting different organization of the data. These ordering properties are reflected in the susceptibility χ and the spin-spin correlation function Gij . The susceptibility turns out to be very useful for signaling the transition between different phases of the system. The correlation function Gij is used as a similarity index, whose value is determined by both the distance between sites vi and vj and also by the density of points near and between these sites. Separation of the spinspin correlations Gij into strong and weak, as evident in Figure 3b, reflects the existence of two categories of collective behavior. In contrast, as shown in Figure 3a, the frequency distribution of distances dij between neighboring points of Figure 1 does not even hint that a natural cut-off distance, which separates neighboring points into two categories, exists. Since the double-peaked shape of the correlations’ distribution persists at all relevant temperatures, the separation into strong and weak correlations is a robust property of the proposed Potts model. This procedure is stochastic, since we use a Monte Carlo procedure to measure the different properties of the system, but it is completely insensitive to initial conditions. Moreover, the cluster distribution as a function of the temperature is known. Basically, there is a competition between the positive interaction, which encourages the spins to be aligned (the energy, which appears in the exponential of the Boltzmann weight, is minimal when all points belong to a single cluster), and the thermal disorder, which assigns a “bonus” that grows exponentially with the number of uncorrelated spins and, hence, with the number of clusters. This method is robust in the presence of noise and is able to recover
Data Clustering
1837
the hierarchical structure of the data without enforcing the presence of clusters. Also the superparamagnetic method is successful in real-life problems, where existing methods failed to overcome the difficulties posed by the existence of different density distributions and many characteristic lengths in the data. Finally we wish to reemphasize the aspect we view as the main advantage of our method: its generic applicability. It is likely and natural to expect that for just about any underlying distribution of data, one will be able to find a particular method, tailor-made to handle the particular distribution, whose performance will be better than that of SPC. If, however, there is no advance knowledge of this distribution, one cannot know which of the existing methods fits best and should be trusted. SPC, on the other hand, will find any lumpiness (if it exists) of the underlying data, without any fine-tuning of its parameters. Appendix: Clusters and the Potts Model The Potts model can be mapped onto a random cluster problem (Fortuin & Kasteleyn, 1972; Coniglio & Klein, 1981; Edwards & Sokal, 1988). In this formulation, clusters are defined as connected graph components governed by a specific probability distribution. We present this alternative formulation here in order to give another motivation for the superparamagnetic method, as well as to facilitate its comparison to graph-based clustering techniques. Consider the following graph-based model whose basic entities are bond variables nij = 0, 1 residing on the edges < i, j > connecting neighboring sites vi and vj . When nij = 1, the bond between sites vi and vj is “occupied,” © ª and when nij = 0, the bond is “vacant.” Given a configuration N = nij , random clusters are defined as the vertices of the connected components of the occupied bonds (where a vertex connected to “vacant” bonds only is considered a cluster containing a single point). The random cluster model is defined by the probability distribution W(N ) =
qC(N ) Y nij p (1 − pij )(1−nij ) , Z hi,ji ij
(A.1)
where C(N ) is the number of clusters of the given bond configuration, the partition sum Z is a normalization constant, and the parameters pij fulfill 1 ≥ pij ≥ 0. The case q = 1 is the percolation model where the joint probability (see equation A.1) factorizes into a product of independent factors for each nij . Thus, the state of each bond is independent of the state of any other bond. This implies, for example, that the most probable state is found simply by setting nij = 1 if pij > 0.5 and nij = 0 otherwise. By choosing q > 1 the weight of any bond configuration N is no longer the product of local independent factors. Instead, the weight of a configuration is also influenced by the spatial
1838
Marcelo Blatt, Shai Wiseman, and Eytan Domany
distribution of the occupied bonds, since configurations with more random clusters are given a higher weight. For instance, it may happen that a bond nij is likely to be vacant, while a bond nkl is likely to be occupied even though pij = pkl . This can occur if the vacancy of nij enhances the number of random clusters, while sites vk and vl are connected through other (than nkl ) occupied bonds. Surprisingly there is a deep connection between the random cluster model and the seemingly unrelated Potts model. The basis for this connection (Edwards & Sokal, 1988) is a joint probability distribution of Potts spins and bond variables: P(S,N ) =
¤ 1 Y£ (1 − pij )(1 − nij ) + pij nij δsi ,sj . Z hi,ji
(A.2)
The marginal probability W(N ) is obtained by summing P(S,N ) over all Potts spin configurations. On the other hand, by setting ¶ µ Jij pij = 1 − exp − T
(A.3)
and summing P (S,N ) over all bond configurations, the marginal probability (see equation 2.3) is obtained. The mapping between the Potts spin model and the random cluster model implies that the superparamagnetic clustering method can be formulated in terms of the random cluster model. One way to see this is to realize that the SW clusters are actually the random clusters. That is the prescription given in section 3 for generating the SW clusters, defined through the conditional probability P(N |S) = P(S,N )/P(S). Therefore, by sampling the spin configurations obtained in the Monte Carlo sequence (according to probability P(S)), the bond configurations obtained are generated with probability W(N ). In addition, remember that the Potts spin-spin correlation function Gij is measured by using equation 4.3 and relying on the statistics of the SW clusters. Since the clusters are obtained through the spin-spin correlations, they can be determined directly from the random cluster model. One of the most salient features of the superparamagnetic method is its probabilistic approach as opposed to the deterministic one taken in other methods. Such deterministic schemes can indeed be recovered in the zero temperature limit of this formulation (see equations A.1 and A.3); at T = 0 only the bond configuration N0 corresponding to the ground state appears with nonvanishing probability. Some of the existing clustering methods can be formulated as deterministic-percolation models (T = 0, q = 1). For instance, the percolation method proposed by Dekel and West (1985) is obtained by choosing the coupling between spins Jij = θ (R − dij ); that is, the interaction between spins si and sj is equal to one if its separation is smaller
Data Clustering
1839
than the clustering parameter R and zero otherwise. Moreover, the singlelink hierarchy (see, for example, Jain & Dubes, 1988) is obtained by varying the clustering parameter R. Clearly, in these processes the reward on the number of clusters is ruled out, and therefore only pairwise information is used in those procedures. Jardine and Sibson (1971) attempted to list the essential characteristics of useful clustering methods and concluded that the single-link method was the only one that satisfied all the mathematical criteria. However, in practice it performs poorly because single-link clusters easily chain together and are often straggly. Only a single connecting edge is needed to merge two large clusters. To some extent, the superparamagnetic method overcomes this problem by introducing a bonus on the number of clusters, which is reflected by the fact that the system prefers to break apart clusters that are connected by a small number of bonds. Fukunaga’s (1990) valley-seeking method is recovered in the case q > 1 with interaction between spins Jij = θ (R − dij ). In this case, the Hamiltonian (see equation 2.1) is just the class separability measure of this algorithm where a Metropolis relaxation at T = 0 is used to minimize it. The relaxation process terminates at some local minimum of the energy function, and points with the same spin value are assigned to a cluster. This procedure depends strongly on the initial conditions and is likely to stop at a metastable state that does not correspond to the correct answer. Acknowledgments We thank I. Kanter for many useful discussions. This research has been supported by the Germany-Israel Science Foundation. References Ahuja, N. (1982). Dot pattern processing using Voronoi neighborhood. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI, 4, 336–343. Ball, G., & Hall, D. (1967). A clustering technique for summarizing multivariate data. Behavioral Science, 12, 153–155. Baraldi, A., & Parmiggiani, F. (1995). A neural network for unsupervised categorization of multivalued input patterns: An application to satellite image clustering. IEEE Transactions on Geoscience and Remote Sensing, 33(2), 305–316. Baxter, R. J. (1973). Potts model at the critical temperature. Journal of Physics, C6, L445–448. Binder, K., & Heermann, D. W. (1988). Monte Carlo simulations in statistical physics: An introduction. Berlin: Springer–Verlag. Billoire, A., Lacaze, R., Morel, A., Gupta, S., Irback, A., & Petersson, B. (1991). Dynamics near a first-order phase transition with the Metropolis and SwendsenWang algorithms. Nuclear Physics, B358, 231–248. Blatt, M., Wiseman, S., & Domany, E. (1996a). Super-paramagnetic clustering of
1840
Marcelo Blatt, Shai Wiseman, and Eytan Domany
data. Physical Review Letters, 76, 3251–3255. Blatt, M., Wiseman, S., & Domany, E. (1996b). Clustering data through an analogy to the Potts model. In D. Touretzky, Mozer, & Hasselmo (Eds.), Advances in Neural Information Processing Systems (Vol. 8, p. 416). Cambridge, MA: MIT Press. Blatt, M., Wiseman, S., & Domany, E. (1996c). Method and apparatus for clustering data. U.S. patent application. Buhmann, J. M., & Kuhnel, ¨ H. (1993). Vector quantization with complexity costs. IEEE Transactions Information Theory, 39, 1133. Chen, S., Ferrenberg, A. M., & Landau, D. P. (1992). Randomness-induced second-order transitions in the two-dimensional eight-state Potts model: A Monte Carlo study. Physical Review Letters, 69 (8), 1213–1215. Coniglio, A., & Klein, W. (1981). Thermal phase transitions at the percolationthreshold. Physics Letters, A84, 83–84. Cranias, L., Papageorgiou, H., & Piperidis, S. (1994). Clustering: A technique for search space reduction in example-based machine translation. In Proceedings of the 1994 IEEE International Conference on Systems, Man, and Cybernetics. Humans, Information and Technology (Vol. 1, pp. 1–6). New York: IEEE. Dekel, A., & West, M. J. (1985). On percolation as a cosmological test. Astrophysical Journal, 288, 411–417. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley-Interscience. Edwards, R. G., & Sokal, A. D. (1988). Generalization of the Fortuin-KasteleynSwendsen-Wang representation and Monte Carlo algorithm. Physical Review, D 38, 2009–2012. Faber, V., Hochberg, J. G., Kelly, P. M., Thomas, T. R., & White, J. M. (1994). Concept extraction, a data-mining technique. Los Alamos Science, 22, 122–149. Fanty, M., & Cole, R. (1991). Spoken letter recognition. In R. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in Neural Information Processing Systems (Vol. 3, pp. 220–226). San Mateo, CA: Morgan-Kaufmann. Foote, J. T., & Silverman, H. F. (1994). A model distance measure for talker clustering and identification. In Proceedings of the 1994 IEEE International Conference on Acoustics, Speech and Signal Processing (Vol. 1, pp. 317–320). New York: IEEE. Fortuin, C. M., & Kasteleyn, P. W. (1972). On the random-cluster model. Physica (Utrecht), 57, 536–564. Fowlkes, E. B., Gnanadesikan, R., & Kettering, J. R. (1988). Variable selection in clustering. Journal of Classification, 5, 205–228. Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249–266. Fu, Y., & Anderson, P. W. (1986). Applications of statistical mechanics to NPcomplete problems in combinatorial optimization. Journal of Physics A: Math. Gen., 19, 1605–1620. Fukunaga, K. (1990). Introduction to statistical pattern recognition. San Diego: Academic Press. Gdalyahu, Y., & Weinshall, D. (1997). Local curve matching for object recognition without prior knowledge. Proceedings of DARPA, Image Understanding
Data Clustering
1841
Workshop, New Orleans, May 1997. Gould, H., & Tobochnik, J. (1988). An introduction to computer simulation methods, part II. Reading, MA: Addison-Wesley. Gould, H., & Tobochnik, J. (1989). Overcoming critical slowing down. Computers in Physics, 29, 82–86. Hennecke, M., & Heyken, U. (1993). Critical-dynamics of cluster algorithms in the dilute Ising-model. Journal of Statistical Physics, 72, 829–844. Hertz, J., Krogh, A., & Palmer, R. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison–Wesley. Hoshen, J., & Kopelman, R. (1976). Percolation and cluster distribution. I. Cluster multiple labeling technique and critical concentration algorithm. Physical Review, B14, 3438–3445. Iokibe, T. (1994). A method for automatic rule and membership function generation by discretionary fuzzy performance function and its application to a practical system. In R. Hall, H. Ying, I. Langari, & O. Yen (Eds.), Proceedings of the First International Joint Conference of the North American Fuzzy Information Processing Society Biannual Conference, the Industrial Fuzzy Control and Intelligent Systems Conference, and the NASA Joint Technology Workshop on Neural Networks and Fuzzy Logic (pp. 363–364). New York: IEEE. Jain, A. K., and Dubes, R. C. (1988). Algorithms for Clustering Data. Englewood Cliffs, NJ: Prentice Hall. Jardine, N., & Sibson, R. (1971). Mathematical taxonomy. New York: Wiley. Kamata, S., Eason, R. O., & Kawaguchi, E. (1991). Classification of Landsat image data and its evaluation using a neural network approach. Transactions of the Society of Instrument and Control Engineers, 27(11), 1302–1306. Kamata, S., Kawaguchi, E., & Niimi, M. (1995). An interactive analysis method for multidimensional images using a Hilbert curve. Systems and Computers in Japan, 27, 83–92. Kamgar-Parsi, B., & Kanal, L. N. (1985). An improved branch and bound algorithm for computing K-nearest neighbors. Pattern Recognition Letters, 3, 7–12. Kandel, D., & Domany, E. (1991). General cluster Monte Carlo dynamics. Physical Review, B43, 8539–8548. Karayiannis, N. B. (1994). Maximum entropy clustering algorithms and their application in image compression. In Proceedings of the 1994 IEEE International Conference on Systems, Man, and Cybernetics. Humans, Information and Technology (Vol. 1, pp. 337–342). New York: IEEE. Kelly, P. M., & White, J. M. (1993). Preprocessing remotely-sensed data for efficient analysis and classification. In SPIE Applications of Artificial Intelligence 1993: Knowledge Based Systems in Aerospace and Industry (pp. 24–30). Washington, DC: International Society for Optical Engineering. Kirkpatrick, S., Gelatt Jr., C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220, 671–680. Kosaka, T., & Sagayama, S. (1994). Tree-structured speaker clustering for fast speaker adaptation. In Proceedings of the 1994 IEEE International Conference on Acoustics, Speech and Signal Processing (Vol. 1, pp. 245–248). New York: IEEE. Larch, D. (1994). Genetic algorithms for terrain categorization of Landsat. In Proceedings of the SPIE—The International Society for Optical Engineering (pp. 2–
1842
Marcelo Blatt, Shai Wiseman, and Eytan Domany
6). Washington, DC: International Society for Optical Engineering. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symp. Math. Stat. Prob. (Vol. I, pp. 281–297). Berkeley: University of California Press. M´ezard, M., & Parisi, G. (1986). A replica analysis of the traveling salesman problem. Journal de Physique, 47, 1285–1286. Miller, D., & Rose, K. (1996). Hierarchical, unsupervised learning with growing via phase transitions. Neural Computation, 8, 425–450. Moody, J., & Darken, C. J. (1989). Fast learning in neural networks of locallytuned processing units. Neural Computation, 1, 281–294. Murphy, P. M., & Aha, D. W. (1994). UCI repository of machine learning databases. http://www.ics.edu/˜mlearn/MLRepository.html, Irvive, CA: University of California, Department of Information and Science. Niedermayer, F. (1990). Improving the improved estimators in O(N) spin models. Physical Letters, B237, 473–475. Nowlan, J. S., & Hinton, G. E. (1991). Evaluation of adaptive mixtures of competing experts. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in Neural Information Processing Systems (Vol. 3, pp. 774–780). San Mateo: Morgan-Kaufmann. Phillips, W. E., Velthuizen, R. P., Phuphanich, S., Hall, L. O., Clarke, L. P., & Silbiger, M. L. (1995). Application of fuzzy c–means segmentation technique for tissue differentiation in MR images of a hemorrhagic glioblastoma multiforme. Magnetic Resonance Imaging, 13, 277–290. Rose, K., Gurewitz, E., & Fox, G. C. (1990). Statistical mechanics and phase transitions in clustering. Physical Review Letters, 65, 945–948. Srinivasan, A. (1994). UCI Repository of machine learning databases (University of California, Irvine) maintained by P. Murphy and D. Aha. Suzuki, M., Shibata, M., & Suto, Y. (1995). Application to the anatomy of the volume rendering–reconstruction of sequential tissue section for rat’s kidney. Medical Imaging Technology, 13, 195–201. Swendsen, R. H. (1993). Modern methods of analyzing Monte Carlo computer simulations. Physica, A194, 53–62. Swendsen, R. H., Wang, S., & Ferrenberg, A. M. (1992). New Monte Carlo methods for improved efficiency of computer simulations in statistical mechanics. In K. Binder (Ed.), The Monte Carlo Method in Condensed Matter Physics (pp. 75– 91). Berlin: Springer-Verlag. Wang, S., & Swendsen, R. H. (1990). Cluster Monte Carlo algorithms. Physica, A167, 565–579. Wiseman, S., Blatt, M., and Domany, E. (1996). Unpublished manuscript. Wolff, U. (1989). Comparison between cluster Monte Carlo algorithms in the Ising spin model. Physics Letters, B228, 379–382. Wong, Y-F. (1993). Clustering data by melting. Neural Computation, 5, 89–104. Wu, F. Y. (1982). The Potts model. Reviews of Modern Physics, 54(1), 235–268. Yuille, A. L., & Kosowsky, J. J. (1994). Statistical physics algorithms that converge. Neural Computation, 6, 341–356. Received July 26, 1996; accepted January 15, 1997.